Computer Science ›› 2015, Vol. 42 ›› Issue (11): 235-239.doi: 10.11896/j.issn.1002-137X.2015.11.048

Previous Articles     Next Articles

Algorithm for k-means Based on Data Stream in Cloud Computing

WANG Fei, QIN Xiao-lin, LIU Liang and SHEN Yao   

  • Online:2018-11-14 Published:2018-11-14

Abstract: k-means algorithm is one of the most commonly used clustering algorithm.Now data scale is exploding,and traditional centralized algorithm can not meet the requirements,so it is an urgent problem to design distributed k-means clustering algorithm currently.Existing distributed k-means algorithms are based on MapReduce framework and don’t consider the clustering center.Since each MapReduce job reads and writes data from distributed file system,it is inefficient to express dependencies between jobs.Then this paper proposed a framework based on data stream.Based on MapReduce framework,this framework models according to the data flow diagram.And it proposed an efficient k-means algorithm on the framework.It uses an improved algorithm based on sampling to confirm clustering center for load ba-lance and reducing iterations.Experimental results demonstrate that the algorithm can efficiently resolve the large scale k-means cluster.

Key words: k-means,MapReduce,Framework,Data stream

[1] Han Jia-wei,Kamber M.Data mining concepts and techniques,second edition[M].Elsevier (Singapore) Pte Ltd,2006:251-263
[2] Kriegel H P,Krger P,Zimek A.Clustering high-dimensionaldata:A survey on subspace clustering,pattern-based clustering,and correlation clustering[J].ACM Transactions on Knowledge Discovery from Data (TKDD),2009,3(1):1
[3] Forgy E.Cluster analysis of multivariate data:Efficiency vs.Interpretability of classifications [J].Biometrics,1965,1(3):768
[4] Malewicz G,Austern M H,Bik A J C,et al.Pregel:a system for large-scale graph processing [C]∥Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data.ACM,2010:135-146
[5] Wang J,Su X.An improved K-Means clustering algorithm[C]∥2011 IEEE 3rd International Conference on Communication Software and Networks (ICCSN).IEEE,2011:44-46
[6] Kumar N S,Rao K N,Govardhan A,et al.Undersampled K-means approach for handling imbalanced distributed data[J].Progress in Artificial Intelligence,2014,3(1):1-10
[7] Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters[J].Communications of the ACM,2008,51(1):107-113
[8] Apache.Hadoop[EB/OL].(2014-4-10)[2014-4-22].http://hadoop.apache.org/
[9] Ordonez C,Omiecinski E.Efficient disk-based K-means clustering for relational databases[J].IEEE Transactions on Know-ledge and Data Engineering,2004,16(8):909-921
[10] Pelleg D,Moore A W.X-means:Extending K-means with Efficient Estimation of the Number of Clusters[C]∥ICML.2000:727-734
[11] Huang J Z,Ng M K,Rong H,et al.Automated variable weighting in k-means type clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(5):657-668
[12] AlSabti K,Ranka S,Singh V.An efficient space-partitioning based algorithm for the K-means clustering[M]∥Methodologies for Knowledge Discovery and Data Mining.Springer Berlin Heidelberg,1999:355-360
[13] 王守强,朱大铭,韩爱丽.基于初始点选取的k-means聚类近似常数算法[J].计算机研究与发展,2007(z2):69-74 Wang Shou-qiang,Zhu Da-ming,Han Ai-li.A Constant Approxi-mate Alogritnm for K-Means Problem Based on Inital Point Selecting [J].Journal of Computer Research and Development,2007(z2):69-74
[14] 李飞,薛彬,黄亚楼.初始中心优化的 K-Means 聚类算法[J].计算机科学,2002,29(7):94-96 Li Fei,Xue Bin,Huang Ya-lou.K-Means Clustering Algorithm with Refined Inital Center[J].Computer Science,2002,29(7):94-96
[15] Dhillon I S,Modha D S.A data-clustering algorithm on distributed memory multiprocessors [M]∥Large-Scale Parallel Data Mining.Springer Berlin Heidelberg,2000:245-260
[16] Zhao W,Ma H,He Q.Parallel k-means clustering based on mapreduce[M]∥Cloud Computing.Springer Berlin Heidelberg,2009:674-679
[17] 江小平,李成华,向文,等.K-means聚类算法的 MapReduce 并行化实现 [J].华中科技大学学报(自然科学版),2011,39(1):120-124 Jiang Xiao-ping,Li Cheng-hua,Xiang Wen,et al.Parallel implementing k-means clustering algorithm using MapReduce programming mode [J].Journal of Huazhong University of Science and Technology (Natural Science Edition),2011,9(1):120-124
[18] Li Q,Wang P,Wang W,et al.An Efficient K-means Clustering Algorithm on MapReduce [C]∥Database Systems for Advanced Applications.Springer International Publishing,2014:357-371

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!