计算机科学 ›› 2015, Vol. 42 ›› Issue (3): 201-205.doi: 10.11896/j.issn.1002-137X.2015.03.041

• 人工智能 • 上一篇    下一篇

一种改进K-means算法的聚类算法CARDBK

朱烨行,李艳玲,崔梦天,杨献文   

  1. 西安邮电大学经济与管理学院 西安710121,第二炮兵工程大学电子工程系 西安710025,西南民族大学计算机科学与技术学院 成都610041;电子科技大学计算机科学与工程学院 成都610000,西安财经学院信息与教育技术中心 西安710061
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家自然科学基金(61379019,9),中国博士后科学基金(2013M540704),四川省学术和技术带头人培养资金,四川省博士后科研基金资助

Clustering Algorithm CARDBK Improved from K-means Algorithm

ZHU Ye-hang, LI Yan-ling, CUI Meng-tian and YANG Xian-wen   

  • Online:2018-11-14 Published:2018-11-14

摘要: CARDBK聚类算法与批K-means算法的不同之处在于,每个点不是只归属于一个簇,而是同时影响多个簇的质心值,一个点影响某一个簇的质心值的程度取决于该点与其它离该点更近的簇的质心之间的距离值。 从聚类结果的熵、纯度、F1值、Rand Index和NMI等5个性能指标值来看,与多个不同算法在多个不同数据集上分别聚类相比, 该算法具有较好的聚类结果;与多个不同算法在同一数据集上很多不同的初始化条件下分别聚类相比,该算法具有较好且稳定的聚类结果;该算法在不同大小数据集上聚类时具有线性伸缩性且速度较快。

关键词: 聚类,文档聚类,文本聚类,K-means,算法

Abstract: The difference between our clustering algorithm and batch K-means algorithm is that in our algorithm each point is not only attributable to one cluster,instead affects multiple cluster centroid values,and the degree of influence of a point on a cluster centroid depends on the distance values between this point and the other more near cluster centroids.Our algorithm and a number of different algorithms on a number of different data sets were clustered respectively from the point of view of their clustering result’s five performance index values such as entropy,purity,F1 value,Rand Index and normalized mutual information,and the results show our algorithm has a better clustering results.Our algorithm and a number of different algorithms were clustered respectively on one same data set but under many different initialization conditions,and clustering results of our algorithm are preferably more stable and better.Cluster on different size data sets by our algorithm has a linear scalability and is faster.

Key words: Clustering,Text clustering,Document clustering,K-means,Algorithm

[1] 朱烨行.文档聚类算法研究[D].西安:西北工业大学,2009
[2] Zhao Ying,Karypis G.Criterion functions for document clustering:Experiments and analysis[R/OL].2003-04-23[2008-10-29].http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download
[3] Anon.an Introduction to Cluster Analysis for Data Mining[EB/OL].2000-02-10[2008-12-2].http://www.dol88.com/p-567183494975.html
[4] 刘泉凤,陆蓓,王小华.文本挖掘中聚类算法的比较研究[J].计算机时代,2005(6):7-8,22
[5] 谷波,张永奎.文本聚类算法的分析与比较[J].电脑开发与应用,2003,16(11):4-6
[6] Bernd F.Some Competitive Learning Methods[R/OL].1997-04-05[2008-10-22].http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/gsn/JavaPaper/
[7] Ridella S,Rovetta S,Zunino R.Plastic Algorithm for AdaptiveVector Quantisation[J].Neural Computing & Applications,1998,7(1):37-51
[8] Pal N R,Bezdek J C,Tsao E C K.Generalized Clustering Networks and Kohonen’s Self-Organizing Scheme[J].IEEE Transaction on Neural Networks,1993,4(4):549-557
[9] Hansen P,Mladenovic N.J-Means:A New Local Search Heuristic for Minimum Sum-of-Squares Clustering[J].Pattern Recognition,2001,34(2):405-413
[10] 唐春生,张磊,潘东,等.文本分类研究进展[EB/OL].[2008-10-24].http://c.xml.org.cn/blog/uploadfile/20076211443809.PDF
[11] Karypis G.CLUTO- Software for Clustering High-Dimensional Datasets[CP/OL].2008[2008-10-25].http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download
[12] Han E H,Boley D,Gini M,et al.WebACE:A webagent for do-cument categorization and exploration[C]∥Proceedings of the Second International Conference on Autonomous Agents.Minneapolis,Minnesota,United States:ACM,1998:408-415
[13] Beil F,Ester M,Xu Xiao-wei.Frequent term-based text clustering[C]∥Proceedings of the Eighth ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining.New York:ACM,2002:436-442
[14] Karypis G,Han Eui-hong.Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval[C]∥Proceedings of the Ninth International Confe-rence on Information and Knowledge Management.New York:ACM,2000:12-19
[15] Hersh W,Buckley C,Leone T,et al.OHSUMED:An Interactive Retrieval Evaluation and New Large Test Collection for Research[C]∥Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Dublin,Ireland:ACM,1994:192-201
[16] Juan A,Vidal E.Comparison of Four Initialization Techniques for the K-Medians Clustering Algorithm[C]∥Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition.London,UK:Springer-Verlag,2000:842-852
[17] 梁冯珍,宋占杰,张玉环.应用概率统计[M].天津:天津大学出版社,2004
[18] Hanselman D,Littlefield B.精通MATLAB:综合辅导与掼[M].李人厚,张平安,译.西安:西安交通大学出版社,1998
[19] Velleman P F,Hoaglin D C.Applications,Basics,and Computing of Exploratory Data Analysis[M].Boston:Duxbury Press,c2004
[20] Macqueen J.Some methods of classification and analysis of multivariate observations[M]∥Le Cam L M,Neyman J,ed..Proc.of the fifth Berkeley Symposium on Mathematical Statistics and Probability.Los Angeles,USA:University of California Press,1967:281-297

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!