一种基于Hadoop平台的新聚类算法

计算机科学 ›› 2014, Vol. 41 ›› Issue (4): 269-272.

一种基于Hadoop平台的新聚类算法

缪裕青,张锦杏,刘少兵,文益民,明媚

桂林电子科技大学计算机科学与工程学院桂林541004;桂林电子科技大学计算机科学与工程学院桂林541004;桂林电子科技大学计算机科学与工程学院桂林541004;桂林电子科技大学计算机科学与工程学院桂林541004;桂林电子科技大学计算机科学与工程学院桂林541004

出版日期:2018-11-14 发布日期:2018-11-14
基金资助:
本文受广西可信软件重点实验室研究课题(KX201116),广西教育厅科研项目(201204LX122)资助

New Clustering Algorithm Bases on Hadoop

MIAO Yu-qing,ZHANG Jin-xing,LIU Shao-bing,WEN Yi-min and MING Mei

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 针对现有很多聚类算法不能有效处理大规模数据的问题,基于微簇和等价连接关系,提出一种能在Hadoop 平台实现高效并行化的聚类算法bigKClustering。算法将紧凑的数据抽象成一个向量,然后通过等价关系对这些向量进行连接,得到最终的聚类结果。实验结果表明,bigKClustering算法不仅具有良好的时间效率和聚类效果,而且具有良好的可伸缩性、加速比和时间稳定性。

关键词: 微簇,等价连接,Hadoop平台,聚类

Abstract: Hadoop is a popular platform to handle huge datasets．But many clustering algorithms can not run effectively over it,for it lacks built-in support for iterative programs,which arises naturally in many clustering applications．We proposed bigClustering which can be easily parallelized in Hadoop MapReduce and done in quite a few MapReduce rounds．Our algorithm is based on the ideas of micro-cluster and equivalence relation．It divides a dataset into many groups and constructs one micro-cluster,which will be treated as a single point,corresponding to each group．All micro-clusters that are closed enough will be connected and put into the same group by the equivalence relation．The center of each group will be calculated and will be the center of a real cluster in the dataset．Experiments show that bigKClustering not only runs fast and obtains high clustering quality but also scales well.

Key words: Micro-cluster,Equivalence relation,Hadoop,Clustering

缪裕青,张锦杏,刘少兵,文益民,明媚. 一种基于Hadoop平台的新聚类算法[J]. 计算机科学, 2014, 41(4): 269-272. https://doi.org/

MIAO Yu-qing,ZHANG Jin-xing,LIU Shao-bing,WEN Yi-min and MING Mei. New Clustering Algorithm Bases on Hadoop[J]. Computer Science, 2014, 41(4): 269-272. https://doi.org/

参考文献

[1] Malewicz G,Austern M H,et al．Pregel:a system for large-scale graph processing[C]∥Proceedings of the 2010international conference on Management of data．Indiana,USA,2010:135-146
[2] Dean J,Ghemawat S．MapReduce:simplified data processing on large clusters[C]∥Proceedings of Operating Systems Design and Implementation．San Francisco,CA,2004:137-150
[3] 赵卫中,马慧芳,傅燕翔,等．基于云计算平台Hadoop的并k-means聚类算法设计研究[J]．计算机科学,2011,8(10):166-169
[4] Ene A,Im Sung-jin,et al．Fast clustering using MapReduce[C]∥Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining．California,USA,2011:681-689
[5] Bahman B,Benjamin M,et al．Scalable k-means++[J]．Proceedings of the VLDB Endowment,2012,5(7):622-633
[6] Aggarwal C C,Han Jia-wei,et al．A Framework for Clustering Evolving Data Streams [C]∥Proceedings of the International Conference on Very Large Data Bases．Berlin,Germany,2003:852-863
[7] Zhang Tian,Ramakrishnan R,et al．BIRCH:an efficient dataclustering method for very large databases[C]∥Proceedings of the 1996ACM SIGMOD international conference on Management of data．Montreal,Quebec,Canada,1996:103-114
[8] Ekanayake J,Li Hui,et al．Twister:a runtime for iterative Map-Reduce[C]∥Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing．Chicago,Illinois,2010:810-818
[9] http://archive.ics.uci.edu/ml/datasets/Cloud
[10] http://elki.dbs.ifi.lmu.de/wiki/DataSetGenerator
[11] Arthur D,Vassilvitskii S．k-means++:The advantages of careful seeding[C]∥Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms．New Orleans,Louisiana,2007:1027-1035

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed