Computer Science ›› 2017, Vol. 44 ›› Issue (Z6): 414-418.doi: 10.11896/j.issn.1002-137X.2017.6A.093

Previous Articles     Next Articles

Performance Comparison of Clustering Algorithms in Spark

HAI Mo and ZHANG You   

  • Online:2017-12-01 Published:2018-12-01

Abstract: The performance of three typical clustering algorithms which are K-means,Bisecting K-means and Gaussian Mixture in Spark,were compared by the experiments from runtime,speedup,scalability and size up.The results show that when the scale of the dataset is hundreds of megabytes,as the number of nodes increases,the runtime of the three algorithms decreases more obviously.When the scale of the dataset is larger than 500MB,the speedup of the three algorithms increases more obviously,and the speedup increases linearly with the increase of the number of nodes.The scala-bility of the three algorithms decreases with the increase of the number of nodes.When the scale of the dataset is larger than 500MB,the scalability of the Bisecting K-means algorithm is the lowest compared to that of the K-means and Gaussian Mixture algorithm.When the scale of the dataset is larger than 100MB,the sizeup of the Gaussian Mixture algorithm is much larger than that of K-means algorithm and bisecting K-mean algorithm.

Key words: Spark,K-means clustering,Bisecting K-means clustering,Gaussian mixture clustering,Runtime,Speedup,Scalability,Sizeup

[1] 陆嘉恒.Hadoop实战[M].北京:机械工业出版社,2012.
[2] 周品.Hadoop云计算实战[M].北京:清华大学出版社,2012.
[3] KONSTANTIN S.The Hadoop distributed file system[C]∥The 26th Symposium on Mass Storage Systems and Technologies.IEEE,2010:1-10.
[4] DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[5] DEAN J,GHEMAWAT S.MapReduce:a flexible data proces-sing tool[J].Communications of the ACM,2010,3(1):72-77.
[6] KARAU H.Fast Data Processing With Spark[M].Bermingham:Packt Publishing Ltd,2013.
[7] ZAHARIA M,CHOWDHURY M,DAS T,et al.Fast and intera-ctive analytics over Hadoop data with Spark[J].USENIX,2012,7(4):45-51.
[8] 梁彦.基于分布式平台Spark和YARN的数据挖掘算法的并行化研究[D].广州:中山大学,2014.
[9] 唐振坤.基于Spark的机器学习平台设计与实现[D].福州:厦门大学,2014.
[10] 陈虹君.基于Spark框架的聚类算法研究[J].电脑知识与技术,2015,11(4):56-57,60.
[11] 王桂兰,周国亮,萨初日拉,等.Spark环境下的并行模糊C均值聚类算法[J].计算机应用,2016,6(2):342-347.
[12] 吴哲夫,张彤,肖鹰.基于Spark平台的K-means聚类算法改进及并行化实现[J].互联网天地,2016(1):44-50.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!