Computer Science ›› 2020, Vol. 47 ›› Issue (11A): 425-429.doi: 10.11896/jsjkx.190700071

• Big Data & Data Science • Previous Articles     Next Articles

Application of Improved DBSCAN Algorithm on Spark Platform

DENG Ding-sheng   

  1. School of Science and Technology,Sichuan Minzu College,Kangding,Sichuan 626001,China
  • Online:2020-11-15 Published:2020-11-17
  • About author:DENG Ding-sheng,born in 1978,asso-ciate professor.His main research interests include algorithm analysis and design and so on.
  • Supported by:
    This work was supported by the Key Project of Natural Science of SichuanMinzu College(XYZB19001ZA),Key Project of Natural Science of Sichuan Provincial Education Department (17ZA0295),2017 Applied Demonstration Course Project of Sichuan Minzu College (sfkc201705) and National Natural Science Foundation of China(11461058).

Abstract: Aiming at the problem of high memory occupancy of DBSCAN(Density-Based Spatial Clustering of Applications with Noise) clustering algorithm,this paper combines the improved DBSCAN clustering algorithm with the parallel clustering calculation theory of Spark platform,and the clustering and processing methods for massive data are clustered,which greatly reduces the memory usage of the algorithm.The experimental simulation results show that the proposed parallel computing method can effectively reduce the shortage of memory,and it also can be used to evaluate the clustering effect of the DBSCAN clustering algorithm on the Hadoop platform,and compare and analyze the twoclustering methods to obtain better computing performance.Besides,the acceleration is increased by about 24% compared with that on the Hadoop platform.The proposed method can be used to evaluate the pros and cons of the DBSCAN clustering algorithm in clustering.

Key words: Parallel computing, DBSCAN, Clustering algorithm, Spark, Clustering acceleration ratio

CLC Number: 

  • TP391
[1] AFSAR M,TAYARANI-N M H,AZIZ M.An adaptive competition-based clustering approach for wireless sensor networks[J].Telecommunication Systems,2016,61(1):181-204.
[2] ZHAO W,XIA G S,GOU Z J,et al.An Improved DBSCAN Algorithm[J].Journal of Sichuan Normal University (Natural Science Edition),2013(2):114-116.
[3] WANG L,WU L L,FU D M.A Density-Dased Fuzzy Adaptive Clustering Algorithm[J].Journal of University of Science and Technology Beijing,2014(11):312-316.
[4] ZHANG Q,WANG X,WANG X.An OPTICS Clustering-Based Anomalous Data Filtering Algorithm for Condition Monitoring of Power Equipment[C]//Revised Selected Papers of the Third Ecml Pkdd Workshop on Data Analytics for Renewable Energy Integration.Springer-Verlag New York,Inc.,2015:123-134.
[5] HUANG H,YOO S,YU D,et al.Density-Aware ClusteringBased on Aggregated Heat Kernel and Its Transformation[J].Acm Transactions on Knowledge Discovery from Data,2015,9(4):29-32.
[6] YE X,SAKURAI T.Spectral clustering using robust similarity measure based on closeness of shared Nearest Neighbors[C]//International Joint Conference on Neural Networks.IEEE,2015:1-8.
[7] SINGH G,KAUR J,MULGE Y.Performance evaluation of enhanced hierarchical and partitioning based clustering algorithm (EPBCA) in data mining[C]//International Conference on Applied and Theoretical Computing and Communication Technology.IEEE,2015:805-810.
[8] WANG B,ZHANG C,SONG L,et al.Design and optimization of DBSCAN Algorithm based on CUDA[J].Computer Science,2015,40(5):553-556.
[9] LENG Y,CHEN Z,ZHONG F,et al.BRDPHHC:A Balance RDF Data Partitioning Algorithm Based on Hybrid Hierarchical Clustering[C]//IEEE,International Conference on High PERFORMANCE Computing and Communications.IEEE,2015:1755-1760.
[10] LIN W H,TAN X J,LIU F J,et al.A new directional query method for polygon dataset in spatial database[J].Earth Science Informatics,2015,8(4):775-786.
[11] EZUGWU A E,FRINCU M E,JUNAIDU S B.A Multiagent-Based Approach to Scheduling of Multi-component Applications in Distributed Systems[J].Advances in Intelligent Systems and Computing,2015,347:1-12.
[12] ZHANG J,YOU S,GRUENWALD L.Large-scale spatial data processing on GPUs and GPU-accelerated clusters[J].Sigspatial Special,2015,6(3):27-34.
[13] YANG J.From Google File System to Omega:A Decade of Advancement in Big Data Management at Google[C]//IEEE First International Conference on Big Data Computing Service and Applications.IEEE,2015:249-255.
[14] WANG Y,ZHANG L,TAN J,et al.HydraDB:a resilient RDMA-driven key-value middleware for in-memory cluster computing[C]//2015 SC-International Conference for High Perfor-mance Computing,Networking,Storage and Analysis.IEEE,2017:22.
[15] WANG X,WU Y,JIANG X H,et al.Incremental Parallel Fast Clustering Based on DBSCAN Algorithm Under Large-scale Data Sets[J].Computer Applications and Software,2018,35(4):269-280.
[16] APILETTI D,GARZA P,PULVIRENTI F.A Review of Scalable Approaches for Frequent Itemset Mining[C]//East European Conference on Advances in Databases and Information Systems.Springer International Publishing,2015:243-247.
[17] WANG B,ZHANG C,SONG L,et al.Design and optimization of DBSCAN Algorithm based on CUDA[J].Computer Science,2015,40(5):553-556.
[18] CHEN R,ZHANG Y,ZHANG J,et al.Design and Optimizations of the MD5 Crypt Cracking Algorithm Based on CUDA[C]//International Conference on Cloud Computing.Springer International Publishing,2014:155-164.
[19] NING J F.DBSCAN Text Clustering Algorithm Based on Spark Framework[J].Journal of Shantou University (Natural Science Edition),2018,33(2):73-80.
[20] PENG X Y,YANG Y B,WANG C D,et al.An Efficient Parallel Nonlinear Clustering Algorithm Using MapReduce[C]//2016 IEEE International Parallel and Distributed Processing Sympo-sium Workshops.IEEE,2016:1473-1476.
[1] ZHANG Yu, LU Yi-hong, HUANG De-cai. Weighted Hesitant Fuzzy Clustering Based on Density Peaks [J]. Computer Science, 2021, 48(1): 145-151.
[2] MA Meng-yu, WU Ye, CHEN Luo, WU Jiang-jiang, LI Jun, JING Ning. Display-oriented Data Visualization Technique for Large-scale Geographic Vector Data [J]. Computer Science, 2020, 47(9): 117-122.
[3] CHEN Guo-liang, ZHANG Yu-jie, . Development of Parallel Computing Subject [J]. Computer Science, 2020, 47(8): 1-4.
[4] YANG Wang-dong, WANG Hao-tian, ZHANG Yu-feng, LIN Sheng-le, CAI Qin-yun. Survey of Heterogeneous Hybrid Parallel Computing [J]. Computer Science, 2020, 47(8): 5-16.
[5] XU Shou-kun, NI Chu-han, JI Chen-chen, LI Ning. Image Caption of Safety Helmets Wearing in Construction Scene Based on YOLOv3 [J]. Computer Science, 2020, 47(8): 233-240.
[6] LUO Jin-nan and ZHANG Ji-min. Rail Area Extraction Using Extended Haar-like Features and DBSCAN Clustering [J]. Computer Science, 2020, 47(6A): 153-156.
[7] YANG Zong-lin, LI Tian-rui, LIU Sheng-jiu, YIN Cheng-feng, JIA Zhen, ZHU Jie. Streaming Parallel Text Proofreading Based on Spark Streaming [J]. Computer Science, 2020, 47(4): 36-41.
[8] ZHU An-qing, LI Shuai, TANG Xiao-dong. Parallel FP_growth Association Rules Mining Method on Spark Platform [J]. Computer Science, 2020, 47(12): 139-143.
[9] YU Xin-yi, SHI Tian-feng, TANG Quan-rui, YIN Hui-wu, OU Lin-lin. Industrial Equipment Management System for Predictive Maintenance [J]. Computer Science, 2020, 47(11A): 667-672.
[10] XU Chuan-fu,WANG Xi,LIU Shu,CHEN Shi-zhao,LIN Yu. Large-scale High-performance Lattice Boltzmann Multi-phase Flow Simulations Based on Python [J]. Computer Science, 2020, 47(1): 17-23.
[11] XU Lei, CHEN Rong-liang, CAI Xiao-chuan. Scalable Parallel Finite Volume Lattice Boltzmann Method Based on Unstructured Grid [J]. Computer Science, 2019, 46(8): 84-88.
[12] JIA Ning, LI Ying-da. Construction of Personalized Health Monitoring Platform Based on Intelligent Wearable Device [J]. Computer Science, 2019, 46(6A): 566-570.
[13] ZHANG Jian-xin, LIU Hong, LI Yan. Efficient Grouping Method for Crowd Evacuation [J]. Computer Science, 2019, 46(6): 231-238.
[14] ZHAO Jun-xian, YU Jian. Optimization of Spark RDD Based on Non-serialization Native Storage [J]. Computer Science, 2019, 46(5): 143-149.
[15] WEI Liang, LIN Zi-yu, LAI Yong-xuan. DFTS:A Top-k Skyline Query for Large Datasets [J]. Computer Science, 2019, 46(5): 150-156.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75 .
[2] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[3] ZHOU Yan-ping and YE Qiao-lin. L1-norm Distance Based Least Squares Twin Support Vector Machine[J]. Computer Science, 2018, 45(4): 100 -105 .
[4] WANG Zhen-chao, HOU Huan-huan and LIAN Rui. Path Optimization Scheme for Restraining Degree of Disorder in CMT[J]. Computer Science, 2018, 45(4): 122 -125 .
[5] YANG Yu-qi, ZHANG Guo-an and JIN Xi-long. Dual-cluster-head Routing Protocol Based on Vehicle Density in VANETs[J]. Computer Science, 2018, 45(4): 126 -130 .
[6] SHI Chao, XIE Zai-peng, LIU Han and LV Xin. Optimization of Container Deployment Strategy Based on Stable Matching[J]. Computer Science, 2018, 45(4): 131 -136 .
[7] PANG Bo, JIN Qian-kun, HENIGULI·Wu Mai Er and QI Xing-bin. Routing Scheme Based on Network Slicing and ILP Model in SDN[J]. Computer Science, 2018, 45(4): 143 -147 .
[8] GUO Shuai, LIU Liang and QIN Xiao-lin. Spatial Keyword Range Query with User Preferences Constraint[J]. Computer Science, 2018, 45(4): 182 -189 .
[9] ZHAN Yun-jiao, WEI Ou and HU Jun. Formal Description of Requirement of Slats and Flaps Control System for DO-178C Case[J]. Computer Science, 2018, 45(4): 196 -202 .
[10] WEN Jun-hao, SUN Guang-hui and LI Shun. Study on Matrix Factorization Recommendation Algorithm Based on User Clustering and Mobile Context[J]. Computer Science, 2018, 45(4): 215 -219 .