Computer Science ›› 2022, Vol. 49 ›› Issue (1): 121-132.doi: 10.11896/jsjkx.201100148

• Database & Big Data & Data Science • Previous Articles     Next Articles

Study on Density Parameter and Center-Replacement Combined K-means and New Clustering Validity Index

ZHANG Ya-di, SUN Yue, LIU Feng, ZHU Er-zhou   

  1. School of Computer Science and Technology,Anhui University,Hefei 230601,China
  • Received:2020-11-23 Revised:2021-04-19 Online:2022-01-15 Published:2022-01-18
  • About author:ZHANG Ya-di,born in 1996,postgra-duate.Her main research interests include cluster analysis and machine learning.
    ZHU Er-zhou,born in 1981,Ph.D,associate professor,postgraduate supervisor.His main research interests include virtualization,program analysis,data mining,and information security.
  • Supported by:
    Natural Science Foundation of Anhui Province(General Project)(2008085MF188).

Abstract: As a classical data mining technique,clustering is widely used in fields as pattern recognition,machine learning,artificial intelligence,and so on.By effective clustering analysis,the underlying structures of datasets can be identified.As a commonly used partitional clustering algorithm,K-means is simple of implementation and efficient on classifying large scale datasets.However,due to the influence of the convergence rule,the traditional K-means is still suffering problems as sensitive to the initial clustering centers,cannot properly process non-convex distributed datasets and datasets with outliers.This paper proposes the DC-Kmeans (density parameter and center replacement K-means),an improved K-means algorithm based on the density parameter and center replacement.Due to the gradually selecting of initial clustering centers and continuously update imprecision old centers,the DC-Kmeans is more accurate than the traditional K-means.Two novel methods are also proposed for optimally clustering:1)a novel clustering validity index (CVI),SCVI (Sum of the inner-cluster compactness and the inter-cluster separateness based CVI),is proposed to evaluate the results of the DC-Kmeans;2)a new algorithm,OCNS (optimal clustering number determination based on SCVI),is designed to determine the optimal clustering numbers for different datasets.Experimental results demonstrate that the proposed clustering method is effective for many kinds of datasets.

Key words: Cluster center, Clustering algorithm, Clustering validity index, Data mining, Optimal clustering number

CLC Number: 

  • TP181
[1]XU R,WUNSCH D.Survey of clustering algorithm[J].IEEETransactions on Neural Networks,2005,16(3):645-678.
[2]LIANG B,LIANG J Y,CHAO S,et al.Fast global Kmeans clustering based on local geometrical information[J].Information Sciences,2013,245:168-180.
[3]REDMONDS J,HENEGHANC.A method for initialising theKmeans clustering algorithm usingkd-trees[J].Pattern Recognition Letters,2007,28(8):965-973.
[4]ZHOU S B,XUZ Y.A novel internal validity index based on the cluster centre and the nearest neighbour cluster[J].Applied Soft Computing,2018,71:78-88.
[5]ZHU E Z,MA R H.An effective partitional clustering algorithm based on new clustering validity index[J].Applied Soft Computing,2018,71:608-621.
[6]CALINSKI T,HARABASZ J.A dendrite method for clusteranalysis[J].Communications in Statistics,1974,3(1):1-27.
[7]ROUSSEEUW P J.Silhouettes:A graphical aid to the interpretation and validation of cluster analysis[J].Journal of Computational and Applied Mathematics,1987,22:53-65.
[8]GURRUTXAGA I,ALBISUA I,ARBELAITZ O,et al.SEP/COP:An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index[J].Pattern Recognition,2010,43:3364-3373.
[9]YUE S H,WANG J P,WANG J,et al.A new validity index for evaluating the clustering results by partitional clustering algorithm[J].Soft Computing,2016,20(3):1127-1138.
[10]CHEN X Y,SU Y L,CHEN Y,et al.GKmeans:an EfficientKmeans Clustering Algorithm Based on Grid[C]//Proceedings of the 1st International Symposium on Computer Network and Multimedia Technology (CNMT 2009).Wuhan,China,2009:18-20.
[11]ISLAM M Z,ESTIVILL-CASTRO V,RAHMAN M A,et al.Combining Kmeans and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering[J].Expert Systems with Applications,2018,91:402-417.
[12]YODER J,PRIEBE C E.SEMI-SUPERVISED Kmeans++[J].Journal of Statistical Computation and Simulation,2017,87(13):2597-2608.
[13]HUSSAIN S F,HARIS M.A Kmeans based co-clustering(kCC) algorithm for sparse,high dimensional data[J].Expert Systems with Applications,2019,118:20-34.
[14]FADAEI A H,KHASTEH S H.Enhanced Kmeans re-cluste-ring over dynamic networks[J].Expert Systems with Applications,2019,132:126-140.
[15]ZHU E Z,ZHANG Y X,WEN P,et al.Fast and Stable Clustering Analysis based on Grid-mapping Kmeans Algorithm and New Clustering Validity Index[J].Neurocomputing,2019,363:149-170.
[16]DUNN J C.A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters[J].Journal of Cybernetics,1974,3:32-57.
[17]HUBERT L,SCHULTZ J.Quadratic assignment as a general data analysis strategy[J].British Journal of Mathematical and Statistical Psychology,1976,29(2):190-241.
[18]MAULIK U,BANDYOPADHYAY S.Performance evaluation of some clustering algorithms and validity indices[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2002,24(12):1650-1654.
[19]DAVIES D L,BOULDIN D W.A cluster separation measure[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1979,1(2):224-227.
[20]BEZDEK J C.Numerical taxonomy with fuzzy sets[J].Journal of Mathematical Biology,1974,7(1):57-71.
[21]BEZDEK J C.Cluster validity with fuzzy sets[J].Journal of Cybernetics,1974,3(3):58-74.
[22]ZALIK K R.Cluster validity index for estimation of fuzzy clusters of different sizes and densities[J].Pattern Recognition,2010,43(10):3374-3390.
[23]KIM D W,LEE K H,LEE D.On cluster validity index for estimation of the optimal number of fuzzy clusters[J].Pattern Re-cognition,2004,37(10):2009-2025.
[24]CHEN M Y,LINKENS D A.Rule-base self-generation and simplification for data-driven fuzzy models[J].Fuzzy Sets and Systems,2004,142(1):243-265.
[25]TANG Y G,SUN F C,SUN Z Q.Improved validation index for fuzzy clustering[C]//Proceedigs of the 2005 American Control Conference (ACC 2005).2005:1120-1125.
[26]PAKHIRA M K,BANDYOPADHYAY S,MAULIK U.Validity index for crisp and fuzzy clusters[J].Pattern Recognition,2004,37(3):487-501.
[27]WU K L,YANG M S.A cluster validity index for fuzzy clustering[J].Pattern Recognition Letters,2005,26(9):1275-1291.
[28]STARCZEWSKI A.A new validity index for crisp clusters[J].Pattern Analysis and Applications,2017,20:687-700.
[29]ZHU E Z,ZHU B B,WEN P,et al.Effective Clustering Analysis based on New Designed CVI and Improved Clustering Algorithms[C]//Proceedings of the 16th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA 2018).2018:766-772.
[30]MAATEN L V D.t-SNE[OL].https://lvdmaaten.github.io/tsne.
[31]WANG Z Y,LIU J L.Kernel Subspace Clustering Based on Se-cond-order Neighbors[J].Computer Science,2021,48(6):86-95.
[32]PENG C C,CHEN Y L,XUN Y M.k-modes Clustering Gua-ranteeing Local Differential Privacy[J].Computer Science,2021,48(2):105-113.
[1] CHAI Hui-min, ZHANG Yong, FANG Min. Aerial Target Grouping Method Based on Feature Similarity Clustering [J]. Computer Science, 2022, 49(9): 70-75.
[2] LI Rong-fan, ZHONG Ting, WU Jin, ZHOU Fan, KUANG Ping. Spatio-Temporal Attention-based Kriging for Land Deformation Data Interpolation [J]. Computer Science, 2022, 49(8): 33-39.
[3] MAO Sen-lin, XIA Zhen, GENG Xin-yu, CHEN Jian-hui, JIANG Hong-xia. FCM Algorithm Based on Density Sensitive Distance and Fuzzy Partition [J]. Computer Science, 2022, 49(6A): 285-290.
[4] YAO Xiao-ming, DING Shi-chang, ZHAO Tao, HUANG Hong, LUO Jar-der, FU Xiao-ming. Big Data-driven Based Socioeconomic Status Analysis:A Survey [J]. Computer Science, 2022, 49(4): 80-87.
[5] KONG Yu-ting, TAN Fu-xiang, ZHAO Xin, ZHANG Zheng-hang, BAI Lu, QIAN Yu-rong. Review of K-means Algorithm Optimization Based on Differential Privacy [J]. Computer Science, 2022, 49(2): 162-173.
[6] MA Dong, LI Xin-yuan, CHEN Hong-mei, XIAO Qing. Mining Spatial co-location Patterns with Star High Influence [J]. Computer Science, 2022, 49(1): 166-174.
[7] QIAO Ying-jing, GAO Bao-lu, SHI Rui-xue, LIU Xuan, WANG Zhao-hui. Improved FCM Brain MRI Image Segmentation Algorithm Based on Tamura Texture Feature [J]. Computer Science, 2021, 48(8): 111-117.
[8] XU Hui-hui, YAN Hua. Relative Risk Degree Based Risk Factor Analysis Algorithm for Congenital Heart Disease in Children [J]. Computer Science, 2021, 48(6): 210-214.
[9] LI Shan, XU Xin-zheng. Parallel Pruning from Two Aspects for VGG16 Optimization [J]. Computer Science, 2021, 48(6): 227-233.
[10] ZHANG Yan-jin, BAI Liang. Fast Symbolic Data Clustering Algorithm Based on Symbolic Relation Graph [J]. Computer Science, 2021, 48(4): 111-116.
[11] TANG Xin-yao, ZHANG Zheng-jun, CHU Jie, YAN Tao. Density Peaks Clustering Algorithm Based on Natural Nearest Neighbor [J]. Computer Science, 2021, 48(3): 151-157.
[12] ZHANG Han-shuo, YANG Dong-ju. Technology Data Analysis Algorithm Based on Relational Graph [J]. Computer Science, 2021, 48(3): 174-179.
[13] ZOU Cheng-ming, CHEN De. Unsupervised Anomaly Detection Method for High-dimensional Big Data Analysis [J]. Computer Science, 2021, 48(2): 121-127.
[14] WANG Mao-guang, YANG Hang. Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble [J]. Computer Science, 2021, 48(11A): 71-76.
[15] LIU Xin-bin, WANG Li-zhen, ZHOU Li-hua. MLCPM-UC:A Multi-level Co-location Pattern Mining Algorithm Based on Uniform Coefficient of Pattern Instance Distribution [J]. Computer Science, 2021, 48(11): 208-218.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!