Computer Science ›› 2015, Vol. 42 ›› Issue (Z6): 465-469.

Previous Articles     Next Articles

Comparison Research on Mahout Clustering Algorithms under Hadoop Platform

NIU Yi-han and HAI Mo   

  • Online:2018-11-14 Published:2018-11-14

Abstract: Clustering is an important technique in data mining,and it is used to divide the congregation of physical or abstract objects into multiple classes consisting of similar objects.How to apply the traditional clustering algorithm into the clustering of large scale data is the hot research issue in the current data research field.This article conducts the theo-ry analysis and comparison on the principle of three kinds of clustering algorithms of Canopy,Standard K-means and Fuzzy K-means in open-source machine learning software library—Mahout under cloud computing platform—Hadoop and the achievement of MapReduce,and on the cluster constructed by the nodes with different number,under the data sets with different scales,conduct experiment on the three kinds of clustering algorithms,and then conduct comparison from the three aspects of speedup ratio,scalability and scale growth.The experimental results show that:in parallel environment,the running speed of Canopy algorithm is the fastest,K-means algorithm is the second and Fuzzy K-means is the slowest;the three kinds of algorithms have better speedup ratio,and among them,the speedup ratio of Canopy algorithm is the best,the speedup ratio of Fuzzy K-means algorithm substantially increases after the amount of data and the number of nodes achieving a certain scale;the three kinds of algorithms have better scalability and scale growth,and among them,the scalability of Canopy algorithm is the best,the increasing amplitude of scalability and scale growth of Fuzzy K-means algorithm is the largest.

Key words: Clustering,Hadoop,Mahout,K-means,Fuzzy K-means,Canopy

[1] 赵卫中,马慧芳,傅燕翔,等.基于云计算平台Hadoop的并行k-means聚类算法设计研究[J].计算机科学,2011(10):166-168,176
[2] Owen S,Anil R,Dunning T,et al.Mahout in action[M].USA:Manning Publications,2010
[3] 胡俊.集群环境下聚类算法的并行化研究与实现[D].上海:华东师范大学,2010
[4] Ericson C,Pallickara S.On the performance of high dimensional data clustering and classification algorithms[J].Future Generation Computer Systems,2013(29):1024-1034
[5] 潘吴斌.基于云计算的并行K-means气象数据挖掘研究与应用[D].南京:南京信息工程大学,2013
[6] 怀特.Hadoop权威指南[M].北京:清华大学出版社,2010
[7] 王彦明,奉国和,薛云.近年来Hadoop国外研究综述[J].计算机系统应用,2013,2(6):1-5,28
[8] Apache Hadoop.http://Hadoop.apache.org
[9] Apache Mahout.http://Mahout.apache.org
[10] 张明辉.基于Hadoop的数据挖掘算法的分析与研究[D].昆明:昆明理工大学,2012

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!