计算机科学 ›› 2017, Vol. 44 ›› Issue (Z11): 524-529.doi: 10.11896/j.issn.1002-137X.2017.11A.111

• 综合、交叉与应用 • 上一篇    下一篇

基于Spark的并行DBSCAN算法的设计与实现

黄明吉,张倩   

  1. 北京科技大学机械工程学院 北京100083,北京科技大学机械工程学院 北京100083
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受北京市自然科学基金(2112011),中央高校基本科研业务费基金(2050205)资助

Design and Implementation of Parallel DBSCAN Algorithm Based on Spark

HUANG Ming-ji and ZHANG Qian   

  • Online:2018-12-01 Published:2018-12-01

摘要: 随着云应用对运行时间和性能水平要求的逐步提高,以及内存价格的持续走低,基于内存的分布式计算框架Spark获得了前所未有的关注。主要研究DBSCAN算法在Spark上并行化的设计与实现,通过整体分析找到算法并行化可能的性能瓶颈,并从Spark的角度设计了并行DBSCAN算法的DAG图,优化了算法的并行化策略,最大化地降低了shuffle频率和数据量。最后将并行DBSCAN算法与单机DBSCAN算法进行性能对比,并通过实验分析不同参数对聚类结果的影响。结果表明,与单机DBSCAN算法相比,基于Spark的并行DBSCAN算法在聚类精度没有明显损失的情况下,数据量在3百万行时运行效率提高了37.2%,且加速比达到1.6。

关键词: Spark,并行DBSCAN算法,DAG,并行化策略

Abstract: With the cloud application requiring less running time and higher performance as well as memory prices continuing to decline,the technology of memory-based distributed computing framework,such as Spark,has received unprecedented attention and is widely applied.We designed and implemented parallelized DBSCAN algorithm based on Spark,which can minimize the shuffle frequency and the data amount in shuffle.In order to optimize the algorithm para-llelization strategy,we found the bottleneck of algorithm performance through the analysis and described the DAG of parallel DBSCAN algorithm based on Spark.Finally,we compared the performance of parallel DBSCAN algorithm with the DBSCAN algorithm,and analyzed the influence of different parameters on clustering results.Experimental results show that,compared with DBSCAN algorithm,the running time of parallel DBSCAN algorithm based on Spark decreases 37.2% and the speedup reaches 1.6 respectively on a data set containing three million lines without obvious loss of clustering accuracy.

Key words: Spark,Parallel DBSCAN algorithm,DAG,Parallelization strategy

[1] SUN J G,LIU J,ZHAO L Y.Clustering algorithms research[J].Journal of software,2008,19(1):48-61.
[2] JAIN A K,MURTY M N,FLYNN P J.Data clustering:a review[J].ACM computing surveys (CSUR),1999,31(3):264-323.
[3] WANG W,YANG J,MUNTZ R.STING:A statistical information grid approach to spatial data mining[C]∥VLDB.1997:186-195.
[4] ZHANG T,RAMAKRISHNAN R,LIVNY M.BIRCH:an efficient data clustering method for very large databases[C]∥ACM Sigmod Record.ACM,1996:103-114.
[5] ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]∥Kdd.1996:226-231.
[6] ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al.Spark:cluster computing with working sets[C]∥HotCloud,2010.Boston MA,US,2010:10.
[7] CUI X,ZHU P,YANG X,et al.Optimized big data K-means clustering using MapReduce[J].The Journal of Supercompu-ting,2014,70(3):1249-1259.
[8] 赖向阳,宫秀军,韩来明.一种MapReduce架构下基于遗传算法的K-Medoids聚类[J].计算机科学,2017,4(3):23-26,58.
[9] 乔少杰,郭俊,韩楠,等.大规模复杂网络社区并行发现算法[J].计算机学报,2017,0(3):687-700.
[10] GARG A,MANGLA A,GUPTA N,et al.PBIRCH:a scalable parallel clustering algorithm for incremental data[C]∥10th International Database Engineering and Applications Symposium (IDEAS’06).Delhi,India,2006:315-316.
[11] MLlib | Spark [EB/OL].http://spark.apache.org/mllib/.
[12] XU X,JAGER J,KRIEGEL H P.A fast parallel clustering algorithm for large spatial databases[M]∥High Performance Data Mining.Springer US,1999:263-290.
[13] PATWARY M A,PALSETIA D,AGRAWAL A,et al.A new scalable parallel DBSCAN algorithm using the disjoint-set data structure[C]∥Proceedings of the International Conference on High Performance Computing,Networking,Storage and Analysis.IEEE Computer Society Press,2012:62.
[14] SCICLUNA N,BOUGANIS C S.FPGA-based parallel DBS-CAN architecture[C]∥International Symposium on Applied Reconfigurable Computing.Springer International Publishing,2014:1-12.
[15] LOH W K,YU H.Fast density-based clustering through dataset partition using graphics processing units[J].Information Scien-ces,2015,308(C):94-112.
[16] HE Y,TAN H,LUO W,et al.Mr-dbscan:an efficient parallel density-based clustering algorithm using mapreduce[C]∥2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS).IEEE,2011:473-480.
[17] VEENMAN C J,REINDERS M J T,BACKER E.A maximum variance cluster algorithm[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2002,24(9):1273-1280.
[18] JAIN A K,LAW M H C.Data clustering:A user’s dilemma[C]∥International Conference on Pattern Recognition and Machine Intelligence.2005:1-10 .
[19] CHANG H,YEUNG D Y.ROBUST path-based spectral clustering[J].Pattern Recognition,2008,41(1):191-203.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!