Computer Science ›› 2017, Vol. 44 ›› Issue (Z11): 524-529.doi: 10.11896/j.issn.1002-137X.2017.11A.111

Previous Articles     Next Articles

Design and Implementation of Parallel DBSCAN Algorithm Based on Spark

HUANG Ming-ji and ZHANG Qian   

  • Online:2018-12-01 Published:2018-12-01

Abstract: With the cloud application requiring less running time and higher performance as well as memory prices continuing to decline,the technology of memory-based distributed computing framework,such as Spark,has received unprecedented attention and is widely applied.We designed and implemented parallelized DBSCAN algorithm based on Spark,which can minimize the shuffle frequency and the data amount in shuffle.In order to optimize the algorithm para-llelization strategy,we found the bottleneck of algorithm performance through the analysis and described the DAG of parallel DBSCAN algorithm based on Spark.Finally,we compared the performance of parallel DBSCAN algorithm with the DBSCAN algorithm,and analyzed the influence of different parameters on clustering results.Experimental results show that,compared with DBSCAN algorithm,the running time of parallel DBSCAN algorithm based on Spark decreases 37.2% and the speedup reaches 1.6 respectively on a data set containing three million lines without obvious loss of clustering accuracy.

Key words: Spark,Parallel DBSCAN algorithm,DAG,Parallelization strategy

[1] SUN J G,LIU J,ZHAO L Y.Clustering algorithms research[J].Journal of software,2008,19(1):48-61.
[2] JAIN A K,MURTY M N,FLYNN P J.Data clustering:a review[J].ACM computing surveys (CSUR),1999,31(3):264-323.
[3] WANG W,YANG J,MUNTZ R.STING:A statistical information grid approach to spatial data mining[C]∥VLDB.1997:186-195.
[4] ZHANG T,RAMAKRISHNAN R,LIVNY M.BIRCH:an efficient data clustering method for very large databases[C]∥ACM Sigmod Record.ACM,1996:103-114.
[5] ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]∥Kdd.1996:226-231.
[6] ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al.Spark:cluster computing with working sets[C]∥HotCloud,2010.Boston MA,US,2010:10.
[7] CUI X,ZHU P,YANG X,et al.Optimized big data K-means clustering using MapReduce[J].The Journal of Supercompu-ting,2014,70(3):1249-1259.
[8] 赖向阳,宫秀军,韩来明.一种MapReduce架构下基于遗传算法的K-Medoids聚类[J].计算机科学,2017,4(3):23-26,58.
[9] 乔少杰,郭俊,韩楠,等.大规模复杂网络社区并行发现算法[J].计算机学报,2017,0(3):687-700.
[10] GARG A,MANGLA A,GUPTA N,et al.PBIRCH:a scalable parallel clustering algorithm for incremental data[C]∥10th International Database Engineering and Applications Symposium (IDEAS’06).Delhi,India,2006:315-316.
[11] MLlib | Spark [EB/OL].http://spark.apache.org/mllib/.
[12] XU X,JAGER J,KRIEGEL H P.A fast parallel clustering algorithm for large spatial databases[M]∥High Performance Data Mining.Springer US,1999:263-290.
[13] PATWARY M A,PALSETIA D,AGRAWAL A,et al.A new scalable parallel DBSCAN algorithm using the disjoint-set data structure[C]∥Proceedings of the International Conference on High Performance Computing,Networking,Storage and Analysis.IEEE Computer Society Press,2012:62.
[14] SCICLUNA N,BOUGANIS C S.FPGA-based parallel DBS-CAN architecture[C]∥International Symposium on Applied Reconfigurable Computing.Springer International Publishing,2014:1-12.
[15] LOH W K,YU H.Fast density-based clustering through dataset partition using graphics processing units[J].Information Scien-ces,2015,308(C):94-112.
[16] HE Y,TAN H,LUO W,et al.Mr-dbscan:an efficient parallel density-based clustering algorithm using mapreduce[C]∥2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS).IEEE,2011:473-480.
[17] VEENMAN C J,REINDERS M J T,BACKER E.A maximum variance cluster algorithm[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2002,24(9):1273-1280.
[18] JAIN A K,LAW M H C.Data clustering:A user’s dilemma[C]∥International Conference on Pattern Recognition and Machine Intelligence.2005:1-10 .
[19] CHANG H,YEUNG D Y.ROBUST path-based spectral clustering[J].Pattern Recognition,2008,41(1):191-203.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!