计算机科学 ›› 2017, Vol. 44 ›› Issue (Z6): 414-418.doi: 10.11896/j.issn.1002-137X.2017.6A.093

• 大数据与数据挖掘 • 上一篇    下一篇

Spark平台下聚类算法的性能比较

海沫,张游   

  1. 中央财经大学信息学院 北京100081;电子科技大学网络与数据安全四川省重点实验室 成都610054,卡内基梅隆大学海因茨学院信息系统管理系 匹兹堡999039
  • 出版日期:2017-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受网络与数据安全四川省重点实验室开放课题(NDSMS201604),中央财经大学青年教师发展基金项目(QJJ1634)资助

Performance Comparison of Clustering Algorithms in Spark

HAI Mo and ZHANG You   

  • Online:2017-12-01 Published:2018-12-01

摘要: 通过实验,从运行时间、加速比、可扩展性和规模增长性4个方面比较了 Spark平台中3种典型的聚类算法即K-means聚类算法、二分K-means聚类算法和高斯混合聚类算法 的性能。实验结果表明:1)随着节点个数的增加,3种算法对百兆以上规模数据集聚类的运行时间明显减少;2)当数据集规模大于500MB时,3种算法的加速比均有明显提高,且随着节点个数的增加,加速比近似于线性增长;3)3种算法的可扩展性随着节点个数的增加而降低,当数据集规模大于500MB时,相对于K-means和高斯混合算法,二分K-means算法的可扩展性最差;4)当数据集规模大于100MB时,高斯混合算法的规模增长性远高于K-means和二分K-means算法。

关键词: Spark,K-means聚类,二分K-means聚类,高斯混合聚类,运行时间,加速比,可扩展性,规模增长性

Abstract: The performance of three typical clustering algorithms which are K-means,Bisecting K-means and Gaussian Mixture in Spark,were compared by the experiments from runtime,speedup,scalability and size up.The results show that when the scale of the dataset is hundreds of megabytes,as the number of nodes increases,the runtime of the three algorithms decreases more obviously.When the scale of the dataset is larger than 500MB,the speedup of the three algorithms increases more obviously,and the speedup increases linearly with the increase of the number of nodes.The scala-bility of the three algorithms decreases with the increase of the number of nodes.When the scale of the dataset is larger than 500MB,the scalability of the Bisecting K-means algorithm is the lowest compared to that of the K-means and Gaussian Mixture algorithm.When the scale of the dataset is larger than 100MB,the sizeup of the Gaussian Mixture algorithm is much larger than that of K-means algorithm and bisecting K-mean algorithm.

Key words: Spark,K-means clustering,Bisecting K-means clustering,Gaussian mixture clustering,Runtime,Speedup,Scalability,Sizeup

[1] 陆嘉恒.Hadoop实战[M].北京:机械工业出版社,2012.
[2] 周品.Hadoop云计算实战[M].北京:清华大学出版社,2012.
[3] KONSTANTIN S.The Hadoop distributed file system[C]∥The 26th Symposium on Mass Storage Systems and Technologies.IEEE,2010:1-10.
[4] DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[5] DEAN J,GHEMAWAT S.MapReduce:a flexible data proces-sing tool[J].Communications of the ACM,2010,3(1):72-77.
[6] KARAU H.Fast Data Processing With Spark[M].Bermingham:Packt Publishing Ltd,2013.
[7] ZAHARIA M,CHOWDHURY M,DAS T,et al.Fast and intera-ctive analytics over Hadoop data with Spark[J].USENIX,2012,7(4):45-51.
[8] 梁彦.基于分布式平台Spark和YARN的数据挖掘算法的并行化研究[D].广州:中山大学,2014.
[9] 唐振坤.基于Spark的机器学习平台设计与实现[D].福州:厦门大学,2014.
[10] 陈虹君.基于Spark框架的聚类算法研究[J].电脑知识与技术,2015,11(4):56-57,60.
[11] 王桂兰,周国亮,萨初日拉,等.Spark环境下的并行模糊C均值聚类算法[J].计算机应用,2016,6(2):342-347.
[12] 吴哲夫,张彤,肖鹰.基于Spark平台的K-means聚类算法改进及并行化实现[J].互联网天地,2016(1):44-50.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!