计算机科学 ›› 2022, Vol. 49 ›› Issue (1): 121-132.doi: 10.11896/jsjkx.201100148

• 数据库&大数据&数据科学 • 上一篇    下一篇

结合密度参数与中心替换的改进K-means算法及新聚类有效性指标研究

张亚迪, 孙悦, 刘锋, 朱二周   

  1. 安徽大学计算机科学与技术学院 合肥230601
  • 收稿日期:2020-11-23 修回日期:2021-04-19 出版日期:2022-01-15 发布日期:2022-01-18
  • 通讯作者: 朱二周(ezzhu@ahu.edu.cn)
  • 作者简介:zhangyd0831@gmail.com
  • 基金资助:
    安徽省自然科学基金(面上项目)(2008085MF188)

Study on Density Parameter and Center-Replacement Combined K-means and New Clustering Validity Index

ZHANG Ya-di, SUN Yue, LIU Feng, ZHU Er-zhou   

  1. School of Computer Science and Technology,Anhui University,Hefei 230601,China
  • Received:2020-11-23 Revised:2021-04-19 Online:2022-01-15 Published:2022-01-18
  • About author:ZHANG Ya-di,born in 1996,postgra-duate.Her main research interests include cluster analysis and machine learning.
    ZHU Er-zhou,born in 1981,Ph.D,associate professor,postgraduate supervisor.His main research interests include virtualization,program analysis,data mining,and information security.
  • Supported by:
    Natural Science Foundation of Anhui Province(General Project)(2008085MF188).

摘要: 聚类是一种经典的数据挖掘技术,它在模式识别、机器学习、人工智能等多个领域得到了广泛的应用。通过聚类分析,目标数据集的深层次结构可以被有效地发掘出来。作为一种常用的划分聚类算法,K-means具有实现简单、能够处理大型数据等优点。然而,受收敛规则的影响,K-means算法仍然存在着对初始类簇中心的选取非常敏感、不能很好地处理非凸型分布和有离群值的数据集等问题。文中提出了一种基于密度参数和中心替换的改进K-means算法DC-Kmeans。该算法采用数据对象的密度参数来逐步确定初始类簇中心,使用中心替换方法更新偏离实际位置的初始中心,因而比传统聚的类算法更加精确。为了获得最佳聚类效果,文中同时提出了一个能够对聚类结果进行有效评价的新聚类有效性指标SCVI和一个能够快速获得目标数据集最佳类簇数的新算法OCNS。实验结果表明,所提聚类方法对各种类型的数据集都是有效的。

关键词: 聚类算法, 聚类有效性指标, 类簇中心, 数据挖掘, 最佳类簇数

Abstract: As a classical data mining technique,clustering is widely used in fields as pattern recognition,machine learning,artificial intelligence,and so on.By effective clustering analysis,the underlying structures of datasets can be identified.As a commonly used partitional clustering algorithm,K-means is simple of implementation and efficient on classifying large scale datasets.However,due to the influence of the convergence rule,the traditional K-means is still suffering problems as sensitive to the initial clustering centers,cannot properly process non-convex distributed datasets and datasets with outliers.This paper proposes the DC-Kmeans (density parameter and center replacement K-means),an improved K-means algorithm based on the density parameter and center replacement.Due to the gradually selecting of initial clustering centers and continuously update imprecision old centers,the DC-Kmeans is more accurate than the traditional K-means.Two novel methods are also proposed for optimally clustering:1)a novel clustering validity index (CVI),SCVI (Sum of the inner-cluster compactness and the inter-cluster separateness based CVI),is proposed to evaluate the results of the DC-Kmeans;2)a new algorithm,OCNS (optimal clustering number determination based on SCVI),is designed to determine the optimal clustering numbers for different datasets.Experimental results demonstrate that the proposed clustering method is effective for many kinds of datasets.

Key words: Cluster center, Clustering algorithm, Clustering validity index, Data mining, Optimal clustering number

中图分类号: 

  • TP181
[1]XU R,WUNSCH D.Survey of clustering algorithm[J].IEEETransactions on Neural Networks,2005,16(3):645-678.
[2]LIANG B,LIANG J Y,CHAO S,et al.Fast global Kmeans clustering based on local geometrical information[J].Information Sciences,2013,245:168-180.
[3]REDMONDS J,HENEGHANC.A method for initialising theKmeans clustering algorithm usingkd-trees[J].Pattern Recognition Letters,2007,28(8):965-973.
[4]ZHOU S B,XUZ Y.A novel internal validity index based on the cluster centre and the nearest neighbour cluster[J].Applied Soft Computing,2018,71:78-88.
[5]ZHU E Z,MA R H.An effective partitional clustering algorithm based on new clustering validity index[J].Applied Soft Computing,2018,71:608-621.
[6]CALINSKI T,HARABASZ J.A dendrite method for clusteranalysis[J].Communications in Statistics,1974,3(1):1-27.
[7]ROUSSEEUW P J.Silhouettes:A graphical aid to the interpretation and validation of cluster analysis[J].Journal of Computational and Applied Mathematics,1987,22:53-65.
[8]GURRUTXAGA I,ALBISUA I,ARBELAITZ O,et al.SEP/COP:An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index[J].Pattern Recognition,2010,43:3364-3373.
[9]YUE S H,WANG J P,WANG J,et al.A new validity index for evaluating the clustering results by partitional clustering algorithm[J].Soft Computing,2016,20(3):1127-1138.
[10]CHEN X Y,SU Y L,CHEN Y,et al.GKmeans:an EfficientKmeans Clustering Algorithm Based on Grid[C]//Proceedings of the 1st International Symposium on Computer Network and Multimedia Technology (CNMT 2009).Wuhan,China,2009:18-20.
[11]ISLAM M Z,ESTIVILL-CASTRO V,RAHMAN M A,et al.Combining Kmeans and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering[J].Expert Systems with Applications,2018,91:402-417.
[12]YODER J,PRIEBE C E.SEMI-SUPERVISED Kmeans++[J].Journal of Statistical Computation and Simulation,2017,87(13):2597-2608.
[13]HUSSAIN S F,HARIS M.A Kmeans based co-clustering(kCC) algorithm for sparse,high dimensional data[J].Expert Systems with Applications,2019,118:20-34.
[14]FADAEI A H,KHASTEH S H.Enhanced Kmeans re-cluste-ring over dynamic networks[J].Expert Systems with Applications,2019,132:126-140.
[15]ZHU E Z,ZHANG Y X,WEN P,et al.Fast and Stable Clustering Analysis based on Grid-mapping Kmeans Algorithm and New Clustering Validity Index[J].Neurocomputing,2019,363:149-170.
[16]DUNN J C.A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters[J].Journal of Cybernetics,1974,3:32-57.
[17]HUBERT L,SCHULTZ J.Quadratic assignment as a general data analysis strategy[J].British Journal of Mathematical and Statistical Psychology,1976,29(2):190-241.
[18]MAULIK U,BANDYOPADHYAY S.Performance evaluation of some clustering algorithms and validity indices[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2002,24(12):1650-1654.
[19]DAVIES D L,BOULDIN D W.A cluster separation measure[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1979,1(2):224-227.
[20]BEZDEK J C.Numerical taxonomy with fuzzy sets[J].Journal of Mathematical Biology,1974,7(1):57-71.
[21]BEZDEK J C.Cluster validity with fuzzy sets[J].Journal of Cybernetics,1974,3(3):58-74.
[22]ZALIK K R.Cluster validity index for estimation of fuzzy clusters of different sizes and densities[J].Pattern Recognition,2010,43(10):3374-3390.
[23]KIM D W,LEE K H,LEE D.On cluster validity index for estimation of the optimal number of fuzzy clusters[J].Pattern Re-cognition,2004,37(10):2009-2025.
[24]CHEN M Y,LINKENS D A.Rule-base self-generation and simplification for data-driven fuzzy models[J].Fuzzy Sets and Systems,2004,142(1):243-265.
[25]TANG Y G,SUN F C,SUN Z Q.Improved validation index for fuzzy clustering[C]//Proceedigs of the 2005 American Control Conference (ACC 2005).2005:1120-1125.
[26]PAKHIRA M K,BANDYOPADHYAY S,MAULIK U.Validity index for crisp and fuzzy clusters[J].Pattern Recognition,2004,37(3):487-501.
[27]WU K L,YANG M S.A cluster validity index for fuzzy clustering[J].Pattern Recognition Letters,2005,26(9):1275-1291.
[28]STARCZEWSKI A.A new validity index for crisp clusters[J].Pattern Analysis and Applications,2017,20:687-700.
[29]ZHU E Z,ZHU B B,WEN P,et al.Effective Clustering Analysis based on New Designed CVI and Improved Clustering Algorithms[C]//Proceedings of the 16th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA 2018).2018:766-772.
[30]MAATEN L V D.t-SNE[OL].https://lvdmaaten.github.io/tsne.
[31]WANG Z Y,LIU J L.Kernel Subspace Clustering Based on Se-cond-order Neighbors[J].Computer Science,2021,48(6):86-95.
[32]PENG C C,CHEN Y L,XUN Y M.k-modes Clustering Gua-ranteeing Local Differential Privacy[J].Computer Science,2021,48(2):105-113.
[1] 柴慧敏, 张勇, 方敏.
基于特征相似度聚类的空中目标分群方法
Aerial Target Grouping Method Based on Feature Similarity Clustering
计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203
[2] 黎嵘繁, 钟婷, 吴劲, 周帆, 匡平.
基于时空注意力克里金的边坡形变数据插值方法
Spatio-Temporal Attention-based Kriging for Land Deformation Data Interpolation
计算机科学, 2022, 49(8): 33-39. https://doi.org/10.11896/jsjkx.210600161
[3] 么晓明, 丁世昌, 赵涛, 黄宏, 罗家德, 傅晓明.
大数据驱动的社会经济地位分析研究综述
Big Data-driven Based Socioeconomic Status Analysis:A Survey
计算机科学, 2022, 49(4): 80-87. https://doi.org/10.11896/jsjkx.211100014
[4] 孔钰婷, 谭富祥, 赵鑫, 张正航, 白璐, 钱育蓉.
基于差分隐私的K-means算法优化研究综述
Review of K-means Algorithm Optimization Based on Differential Privacy
计算机科学, 2022, 49(2): 162-173. https://doi.org/10.11896/jsjkx.201200008
[5] 马董, 李新源, 陈红梅, 肖清.
星型高影响的空间co-location模式挖掘
Mining Spatial co-location Patterns with Star High Influence
计算机科学, 2022, 49(1): 166-174. https://doi.org/10.11896/jsjkx.201000186
[6] 徐慧慧, 晏华.
基于相对危险度的儿童先心病风险因素分析算法
Relative Risk Degree Based Risk Factor Analysis Algorithm for Congenital Heart Disease in Children
计算机科学, 2021, 48(6): 210-214. https://doi.org/10.11896/jsjkx.200500082
[7] 李杉, 许新征.
基于双角度并行剪枝的VGG16优化方法
Parallel Pruning from Two Aspects for VGG16 Optimization
计算机科学, 2021, 48(6): 227-233. https://doi.org/10.11896/jsjkx.200800016
[8] 肖蕾, 陈荣赏, 缪淮扣, 洪煜.
融合聚类算法和缺陷预测的测试用例优先排序方法
Test Case Prioritization Combining Clustering Approach and Fault Prediction
计算机科学, 2021, 48(5): 99-108. https://doi.org/10.11896/jsjkx.200400100
[9] 张岩金, 白亮.
一种基于符号关系图的快速符号数据聚类算法
Fast Symbolic Data Clustering Algorithm Based on Symbolic Relation Graph
计算机科学, 2021, 48(4): 111-116. https://doi.org/10.11896/jsjkx.200800011
[10] 汤鑫瑶, 张正军, 储杰, 严涛.
基于自然最近邻的密度峰值聚类算法
Density Peaks Clustering Algorithm Based on Natural Nearest Neighbor
计算机科学, 2021, 48(3): 151-157. https://doi.org/10.11896/jsjkx.200100112
[11] 张寒烁, 杨冬菊.
基于关系图谱的科技数据分析算法
Technology Data Analysis Algorithm Based on Relational Graph
计算机科学, 2021, 48(3): 174-179. https://doi.org/10.11896/jsjkx.191200154
[12] 邹承明, 陈德.
高维大数据分析的无监督异常检测方法
Unsupervised Anomaly Detection Method for High-dimensional Big Data Analysis
计算机科学, 2021, 48(2): 121-127. https://doi.org/10.11896/jsjkx.191100141
[13] 王茂光, 杨行.
一种基于AP-Entropy选择集成的风控模型和算法
Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble
计算机科学, 2021, 48(11A): 71-76. https://doi.org/10.11896/jsjkx.210200110
[14] 刘新斌, 王丽珍, 周丽华.
MLCPM-UC:一种基于模式实例分布均匀系数的多级co-location模式挖掘算法
MLCPM-UC:A Multi-level Co-location Pattern Mining Algorithm Based on Uniform Coefficient of Pattern Instance Distribution
计算机科学, 2021, 48(11): 208-218. https://doi.org/10.11896/jsjkx.201000097
[15] 王卫东, 徐金慧, 张志峰, 杨习贝.
基于密度峰值聚类的高斯混合模型算法
Gaussian Mixture Models Algorithm Based on Density Peaks Clustering
计算机科学, 2021, 48(10): 191-196. https://doi.org/10.11896/jsjkx.200800191
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!