计算机科学 ›› 2015, Vol. 42 ›› Issue (11): 235-239.doi: 10.11896/j.issn.1002-137X.2015.11.048
王飞,秦小麟,刘亮,沈尧
WANG Fei, QIN Xiao-lin, LIU Liang and SHEN Yao
摘要: k-means算法是一种 最常用的基于划分的聚类算法。传统的集中式k-means算法已不能适应当前呈爆炸式增长的数据规模,设计分布式k-means算法成为了目前亟需解决的问题。现有分布式k-means算法基于MapReduce计算框架且没有考虑初始聚类中心的影响。由于每个MapReduce任务均需要读写分布式文件系统,导致MapReduce不能有效表达多个任务之间的依赖关系,因此提出了一种基于数据流的计算框架,该框架建立在MapReduce之上,将数据处理过程按照数据流图建模。在该框架的基础上,提出了一种高效的k-means算法,它采用基于多次采样的初始聚类中心选取方法来实现负载均衡及减少迭代次数。实验结果表明,该算法的可扩展性较好,且效率比现有算法高。
[1] Han Jia-wei,Kamber M.Data mining concepts and techniques,second edition[M].Elsevier (Singapore) Pte Ltd,2006:251-263 [2] Kriegel H P,Krger P,Zimek A.Clustering high-dimensionaldata:A survey on subspace clustering,pattern-based clustering,and correlation clustering[J].ACM Transactions on Knowledge Discovery from Data (TKDD),2009,3(1):1 [3] Forgy E.Cluster analysis of multivariate data:Efficiency vs.Interpretability of classifications [J].Biometrics,1965,1(3):768 [4] Malewicz G,Austern M H,Bik A J C,et al.Pregel:a system for large-scale graph processing [C]∥Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data.ACM,2010:135-146 [5] Wang J,Su X.An improved K-Means clustering algorithm[C]∥2011 IEEE 3rd International Conference on Communication Software and Networks (ICCSN).IEEE,2011:44-46 [6] Kumar N S,Rao K N,Govardhan A,et al.Undersampled K-means approach for handling imbalanced distributed data[J].Progress in Artificial Intelligence,2014,3(1):1-10 [7] Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters[J].Communications of the ACM,2008,51(1):107-113 [8] Apache.Hadoop[EB/OL].(2014-4-10)[2014-4-22].http://hadoop.apache.org/ [9] Ordonez C,Omiecinski E.Efficient disk-based K-means clustering for relational databases[J].IEEE Transactions on Know-ledge and Data Engineering,2004,16(8):909-921 [10] Pelleg D,Moore A W.X-means:Extending K-means with Efficient Estimation of the Number of Clusters[C]∥ICML.2000:727-734 [11] Huang J Z,Ng M K,Rong H,et al.Automated variable weighting in k-means type clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(5):657-668 [12] AlSabti K,Ranka S,Singh V.An efficient space-partitioning based algorithm for the K-means clustering[M]∥Methodologies for Knowledge Discovery and Data Mining.Springer Berlin Heidelberg,1999:355-360 [13] 王守强,朱大铭,韩爱丽.基于初始点选取的k-means聚类近似常数算法[J].计算机研究与发展,2007(z2):69-74 Wang Shou-qiang,Zhu Da-ming,Han Ai-li.A Constant Approxi-mate Alogritnm for K-Means Problem Based on Inital Point Selecting [J].Journal of Computer Research and Development,2007(z2):69-74 [14] 李飞,薛彬,黄亚楼.初始中心优化的 K-Means 聚类算法[J].计算机科学,2002,29(7):94-96 Li Fei,Xue Bin,Huang Ya-lou.K-Means Clustering Algorithm with Refined Inital Center[J].Computer Science,2002,29(7):94-96 [15] Dhillon I S,Modha D S.A data-clustering algorithm on distributed memory multiprocessors [M]∥Large-Scale Parallel Data Mining.Springer Berlin Heidelberg,2000:245-260 [16] Zhao W,Ma H,He Q.Parallel k-means clustering based on mapreduce[M]∥Cloud Computing.Springer Berlin Heidelberg,2009:674-679 [17] 江小平,李成华,向文,等.K-means聚类算法的 MapReduce 并行化实现 [J].华中科技大学学报(自然科学版),2011,39(1):120-124 Jiang Xiao-ping,Li Cheng-hua,Xiang Wen,et al.Parallel implementing k-means clustering algorithm using MapReduce programming mode [J].Journal of Huazhong University of Science and Technology (Natural Science Edition),2011,9(1):120-124 [18] Li Q,Wang P,Wang W,et al.An Efficient K-means Clustering Algorithm on MapReduce [C]∥Database Systems for Advanced Applications.Springer International Publishing,2014:357-371 |
No related articles found! |
|