Hadoop平台下Mahout聚类算法的比较研究

计算机科学 ›› 2015, Vol. 42 ›› Issue (Z6): 465-469.

Hadoop平台下Mahout聚类算法的比较研究

牛怡晗,海沫

中央财经大学信息学院北京100081,中央财经大学信息学院北京100081

出版日期:2018-11-14 发布日期:2018-11-14
基金资助:
本文受北京高等学校青年英才计划项目(YETP0988)资助

Comparison Research on Mahout Clustering Algorithms under Hadoop Platform

NIU Yi-han and HAI Mo

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 聚类是数据挖掘中的一门重要技术,用于将物理或抽象对象的集合划分成由相似对象构成的多个类。如何将传统聚类算法应用于大规模数据的聚类,是当前大数据研究领域中的热点研究问题。对云计算平台Hadoop下开源机器学习软件库——Mahout中的Canopy、标准K-means、模糊K-means 3种聚类算法的原理及其MapReduce实现进行了比较,并在构建的有不同个数节点的集群上,在不同规模的数据集下对这3种聚类算法进行了实验,从加速比、可扩展性和规模增长性3个方面进行比较。实验结果表明,在并行环境下:Canopy算法运行速度最快, K-means算法次之,模糊K-means最慢；3种算法均有较好的加速比,其中Canopy算法加速比最好,模糊K-means算法在数据量和节点个数达到一定规模后加速比大幅提高；3种算法均有较好的可扩展性和规模增长性,且随着数据规模增加,可扩展性和规模增长性增强,其中Canopy算法可扩展性最好,模糊K-means算法的可扩展性和规模增长性增强幅度最大。

Abstract: Clustering is an important technique in data mining,and it is used to divide the congregation of physical or abstract objects into multiple classes consisting of similar objects.How to apply the traditional clustering algorithm into the clustering of large scale data is the hot research issue in the current data research field.This article conducts the theo-ry analysis and comparison on the principle of three kinds of clustering algorithms of Canopy,Standard K-means and Fuzzy K-means in open-source machine learning software library—Mahout under cloud computing platform—Hadoop and the achievement of MapReduce,and on the cluster constructed by the nodes with different number,under the data sets with different scales,conduct experiment on the three kinds of clustering algorithms,and then conduct comparison from the three aspects of speedup ratio,scalability and scale growth.The experimental results show that:in parallel environment,the running speed of Canopy algorithm is the fastest,K-means algorithm is the second and Fuzzy K-means is the slowest；the three kinds of algorithms have better speedup ratio,and among them,the speedup ratio of Canopy algorithm is the best,the speedup ratio of Fuzzy K-means algorithm substantially increases after the amount of data and the number of nodes achieving a certain scale；the three kinds of algorithms have better scalability and scale growth,and among them,the scalability of Canopy algorithm is the best,the increasing amplitude of scalability and scale growth of Fuzzy K-means algorithm is the largest.

Key words: Clustering,Hadoop,Mahout,K-means,Fuzzy K-means,Canopy

牛怡晗,海沫. Hadoop平台下Mahout聚类算法的比较研究[J]. 计算机科学, 2015, 42(Z6): 465-469. https://doi.org/

NIU Yi-han and HAI Mo. Comparison Research on Mahout Clustering Algorithms under Hadoop Platform[J]. Computer Science, 2015, 42(Z6): 465-469. https://doi.org/

参考文献

[1] 赵卫中,马慧芳,傅燕翔,等.基于云计算平台Hadoop的并行k-means聚类算法设计研究[J].计算机科学,2011(10):166-168,176
[2] Owen S,Anil R,Dunning T,et al.Mahout in action[M].USA:Manning Publications,2010
[3] 胡俊.集群环境下聚类算法的并行化研究与实现[D].上海:华东师范大学,2010
[4] Ericson C,Pallickara S.On the performance of high dimensional data clustering and classification algorithms[J].Future Generation Computer Systems,2013(29):1024-1034
[5] 潘吴斌.基于云计算的并行K-means气象数据挖掘研究与应用[D].南京:南京信息工程大学,2013
[6] 怀特.Hadoop权威指南[M].北京:清华大学出版社,2010
[7] 王彦明,奉国和,薛云.近年来Hadoop国外研究综述[J].计算机系统应用,2013,2(6):1-5,28
[8] Apache Hadoop.http://Hadoop.apache.org
[9] Apache Mahout.http://Mahout.apache.org
[10] 张明辉.基于Hadoop的数据挖掘算法的分析与研究[D].昆明:昆明理工大学,2012

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed