计算机科学 ›› 2015, Vol. 42 ›› Issue (11): 235-239.doi: 10.11896/j.issn.1002-137X.2015.11.048

• 软件与数据库技术 • 上一篇    下一篇

云环境下基于数据流的k-means聚类算法

王飞,秦小麟,刘亮,沈尧   

  1. 南京航空航天大学计算机科学与技术学院 南京210016,南京航空航天大学计算机科学与技术学院 南京210016,南京航空航天大学计算机科学与技术学院 南京210016,南京航空航天大学计算机科学与技术学院 南京210016
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家自然科学基金项目(61373015,61300052),国家教育部高等学校博士学科点专项科研基金资助

Algorithm for k-means Based on Data Stream in Cloud Computing

WANG Fei, QIN Xiao-lin, LIU Liang and SHEN Yao   

  • Online:2018-11-14 Published:2018-11-14

摘要: k-means算法是一种 最常用的基于划分的聚类算法。传统的集中式k-means算法已不能适应当前呈爆炸式增长的数据规模,设计分布式k-means算法成为了目前亟需解决的问题。现有分布式k-means算法基于MapReduce计算框架且没有考虑初始聚类中心的影响。由于每个MapReduce任务均需要读写分布式文件系统,导致MapReduce不能有效表达多个任务之间的依赖关系,因此提出了一种基于数据流的计算框架,该框架建立在MapReduce之上,将数据处理过程按照数据流图建模。在该框架的基础上,提出了一种高效的k-means算法,它采用基于多次采样的初始聚类中心选取方法来实现负载均衡及减少迭代次数。实验结果表明,该算法的可扩展性较好,且效率比现有算法高。

关键词: k-means,MapReduce,计算框架,数据流

Abstract: k-means algorithm is one of the most commonly used clustering algorithm.Now data scale is exploding,and traditional centralized algorithm can not meet the requirements,so it is an urgent problem to design distributed k-means clustering algorithm currently.Existing distributed k-means algorithms are based on MapReduce framework and don’t consider the clustering center.Since each MapReduce job reads and writes data from distributed file system,it is inefficient to express dependencies between jobs.Then this paper proposed a framework based on data stream.Based on MapReduce framework,this framework models according to the data flow diagram.And it proposed an efficient k-means algorithm on the framework.It uses an improved algorithm based on sampling to confirm clustering center for load ba-lance and reducing iterations.Experimental results demonstrate that the algorithm can efficiently resolve the large scale k-means cluster.

Key words: k-means,MapReduce,Framework,Data stream

[1] Han Jia-wei,Kamber M.Data mining concepts and techniques,second edition[M].Elsevier (Singapore) Pte Ltd,2006:251-263
[2] Kriegel H P,Krger P,Zimek A.Clustering high-dimensionaldata:A survey on subspace clustering,pattern-based clustering,and correlation clustering[J].ACM Transactions on Knowledge Discovery from Data (TKDD),2009,3(1):1
[3] Forgy E.Cluster analysis of multivariate data:Efficiency vs.Interpretability of classifications [J].Biometrics,1965,1(3):768
[4] Malewicz G,Austern M H,Bik A J C,et al.Pregel:a system for large-scale graph processing [C]∥Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data.ACM,2010:135-146
[5] Wang J,Su X.An improved K-Means clustering algorithm[C]∥2011 IEEE 3rd International Conference on Communication Software and Networks (ICCSN).IEEE,2011:44-46
[6] Kumar N S,Rao K N,Govardhan A,et al.Undersampled K-means approach for handling imbalanced distributed data[J].Progress in Artificial Intelligence,2014,3(1):1-10
[7] Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters[J].Communications of the ACM,2008,51(1):107-113
[8] Apache.Hadoop[EB/OL].(2014-4-10)[2014-4-22].http://hadoop.apache.org/
[9] Ordonez C,Omiecinski E.Efficient disk-based K-means clustering for relational databases[J].IEEE Transactions on Know-ledge and Data Engineering,2004,16(8):909-921
[10] Pelleg D,Moore A W.X-means:Extending K-means with Efficient Estimation of the Number of Clusters[C]∥ICML.2000:727-734
[11] Huang J Z,Ng M K,Rong H,et al.Automated variable weighting in k-means type clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(5):657-668
[12] AlSabti K,Ranka S,Singh V.An efficient space-partitioning based algorithm for the K-means clustering[M]∥Methodologies for Knowledge Discovery and Data Mining.Springer Berlin Heidelberg,1999:355-360
[13] 王守强,朱大铭,韩爱丽.基于初始点选取的k-means聚类近似常数算法[J].计算机研究与发展,2007(z2):69-74 Wang Shou-qiang,Zhu Da-ming,Han Ai-li.A Constant Approxi-mate Alogritnm for K-Means Problem Based on Inital Point Selecting [J].Journal of Computer Research and Development,2007(z2):69-74
[14] 李飞,薛彬,黄亚楼.初始中心优化的 K-Means 聚类算法[J].计算机科学,2002,29(7):94-96 Li Fei,Xue Bin,Huang Ya-lou.K-Means Clustering Algorithm with Refined Inital Center[J].Computer Science,2002,29(7):94-96
[15] Dhillon I S,Modha D S.A data-clustering algorithm on distributed memory multiprocessors [M]∥Large-Scale Parallel Data Mining.Springer Berlin Heidelberg,2000:245-260
[16] Zhao W,Ma H,He Q.Parallel k-means clustering based on mapreduce[M]∥Cloud Computing.Springer Berlin Heidelberg,2009:674-679
[17] 江小平,李成华,向文,等.K-means聚类算法的 MapReduce 并行化实现 [J].华中科技大学学报(自然科学版),2011,39(1):120-124 Jiang Xiao-ping,Li Cheng-hua,Xiang Wen,et al.Parallel implementing k-means clustering algorithm using MapReduce programming mode [J].Journal of Huazhong University of Science and Technology (Natural Science Edition),2011,9(1):120-124
[18] Li Q,Wang P,Wang W,et al.An Efficient K-means Clustering Algorithm on MapReduce [C]∥Database Systems for Advanced Applications.Springer International Publishing,2014:357-371

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!