Computer Science ›› 2014, Vol. 41 ›› Issue (4): 269-272.

Previous Articles     Next Articles

New Clustering Algorithm Bases on Hadoop

MIAO Yu-qing,ZHANG Jin-xing,LIU Shao-bing,WEN Yi-min and MING Mei   

  • Online:2018-11-14 Published:2018-11-14

Abstract: Hadoop is a popular platform to handle huge datasets.But many clustering algorithms can not run effectively over it,for it lacks built-in support for iterative programs,which arises naturally in many clustering applications.We proposed bigClustering which can be easily parallelized in Hadoop MapReduce and done in quite a few MapReduce rounds.Our algorithm is based on the ideas of micro-cluster and equivalence relation.It divides a dataset into many groups and constructs one micro-cluster,which will be treated as a single point,corresponding to each group.All micro-clusters that are closed enough will be connected and put into the same group by the equivalence relation.The center of each group will be calculated and will be the center of a real cluster in the dataset.Experiments show that bigKClustering not only runs fast and obtains high clustering quality but also scales well.

Key words: Micro-cluster,Equivalence relation,Hadoop,Clustering

[1] Malewicz G,Austern M H,et al.Pregel:a system for large-scale graph processing[C]∥Proceedings of the 2010international conference on Management of data.Indiana,USA,2010:135-146
[2] Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters[C]∥Proceedings of Operating Systems Design and Implementation.San Francisco,CA,2004:137-150
[3] 赵卫中,马慧芳,傅燕翔,等.基于云计算平台Hadoop的并k-means聚类算法设计研究[J].计算机科学,2011,8(10):166-169
[4] Ene A,Im Sung-jin,et al.Fast clustering using MapReduce[C]∥Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining.California,USA,2011:681-689
[5] Bahman B,Benjamin M,et al.Scalable k-means++[J].Proceedings of the VLDB Endowment,2012,5(7):622-633
[6] Aggarwal C C,Han Jia-wei,et al.A Framework for Clustering Evolving Data Streams [C]∥Proceedings of the International Conference on Very Large Data Bases.Berlin,Germany,2003:852-863
[7] Zhang Tian,Ramakrishnan R,et al.BIRCH:an efficient dataclustering method for very large databases[C]∥Proceedings of the 1996ACM SIGMOD international conference on Management of data.Montreal,Quebec,Canada,1996:103-114
[8] Ekanayake J,Li Hui,et al.Twister:a runtime for iterative Map-Reduce[C]∥Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing.Chicago,Illinois,2010:810-818
[9] http://archive.ics.uci.edu/ml/datasets/Cloud
[10] http://elki.dbs.ifi.lmu.de/wiki/DataSetGenerator
[11] Arthur D,Vassilvitskii S.k-means++:The advantages of careful seeding[C]∥Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms.New Orleans,Louisiana,2007:1027-1035

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!