计算机科学 ›› 2020, Vol. 47 ›› Issue (11A): 425-429.doi: 10.11896/jsjkx.190700071

• 大数据&数据科学 • 上一篇    下一篇

一种改进的DBSCAN算法在Spark平台上的应用

邓定胜   

  1. 四川民族学院理工学院 四川 康定 626001
  • 出版日期:2020-11-15 发布日期:2020-11-17
  • 通讯作者: 邓定胜(307729837@qq.com)
  • 基金资助:
    四川民族学院自然科学重点项目(XYZB19001ZA);四川省教育厅自然科学重点项目(17ZA0295);四川民族学院2017年应用型示范课程项目(sfkc201705);国家自然科学基金项目(11461058)

Application of Improved DBSCAN Algorithm on Spark Platform

DENG Ding-sheng   

  1. School of Science and Technology,Sichuan Minzu College,Kangding,Sichuan 626001,China
  • Online:2020-11-15 Published:2020-11-17
  • About author:DENG Ding-sheng,born in 1978,asso-ciate professor.His main research interests include algorithm analysis and design and so on.
  • Supported by:
    This work was supported by the Key Project of Natural Science of SichuanMinzu College(XYZB19001ZA),Key Project of Natural Science of Sichuan Provincial Education Department (17ZA0295),2017 Applied Demonstration Course Project of Sichuan Minzu College (sfkc201705) and National Natural Science Foundation of China(11461058).

摘要: 针对DBSCAN(Density-Based Spatial Clustering of Applications with Noise)聚类算法内存占用率较高的问题,文中将改进的DBSCAN聚类算法与Spark平台并行聚类计算理论相结合,对海量数据采用分而治之的办法进行聚类处理,大幅减小了算法对内存的占用率。实验仿真结果表明,所提出的并行计算方法能够有效缓解内存不足的问题,并且该方法也能够用来评价DBSCAN聚类算法在Hadoop平台下的聚类分析效果,还能对两种聚类方法进行对比分析,从而获得较好的计算性能;且其比在Hadoop平台上的计算加速度提高了24%左右,因此可以用以评价DBSCAN聚类算法在聚类处理方面的优劣。

关键词: DBSCAN, Spark, 并行计算, 聚类加速比, 聚类算法

Abstract: Aiming at the problem of high memory occupancy of DBSCAN(Density-Based Spatial Clustering of Applications with Noise) clustering algorithm,this paper combines the improved DBSCAN clustering algorithm with the parallel clustering calculation theory of Spark platform,and the clustering and processing methods for massive data are clustered,which greatly reduces the memory usage of the algorithm.The experimental simulation results show that the proposed parallel computing method can effectively reduce the shortage of memory,and it also can be used to evaluate the clustering effect of the DBSCAN clustering algorithm on the Hadoop platform,and compare and analyze the twoclustering methods to obtain better computing performance.Besides,the acceleration is increased by about 24% compared with that on the Hadoop platform.The proposed method can be used to evaluate the pros and cons of the DBSCAN clustering algorithm in clustering.

Key words: Clustering acceleration ratio, Clustering algorithm, DBSCAN, Parallel computing, Spark

中图分类号: 

  • TP391
[1] AFSAR M,TAYARANI-N M H,AZIZ M.An adaptive competition-based clustering approach for wireless sensor networks[J].Telecommunication Systems,2016,61(1):181-204.
[2] ZHAO W,XIA G S,GOU Z J,et al.An Improved DBSCAN Algorithm[J].Journal of Sichuan Normal University (Natural Science Edition),2013(2):114-116.
[3] WANG L,WU L L,FU D M.A Density-Dased Fuzzy Adaptive Clustering Algorithm[J].Journal of University of Science and Technology Beijing,2014(11):312-316.
[4] ZHANG Q,WANG X,WANG X.An OPTICS Clustering-Based Anomalous Data Filtering Algorithm for Condition Monitoring of Power Equipment[C]//Revised Selected Papers of the Third Ecml Pkdd Workshop on Data Analytics for Renewable Energy Integration.Springer-Verlag New York,Inc.,2015:123-134.
[5] HUANG H,YOO S,YU D,et al.Density-Aware ClusteringBased on Aggregated Heat Kernel and Its Transformation[J].Acm Transactions on Knowledge Discovery from Data,2015,9(4):29-32.
[6] YE X,SAKURAI T.Spectral clustering using robust similarity measure based on closeness of shared Nearest Neighbors[C]//International Joint Conference on Neural Networks.IEEE,2015:1-8.
[7] SINGH G,KAUR J,MULGE Y.Performance evaluation of enhanced hierarchical and partitioning based clustering algorithm (EPBCA) in data mining[C]//International Conference on Applied and Theoretical Computing and Communication Technology.IEEE,2015:805-810.
[8] WANG B,ZHANG C,SONG L,et al.Design and optimization of DBSCAN Algorithm based on CUDA[J].Computer Science,2015,40(5):553-556.
[9] LENG Y,CHEN Z,ZHONG F,et al.BRDPHHC:A Balance RDF Data Partitioning Algorithm Based on Hybrid Hierarchical Clustering[C]//IEEE,International Conference on High PERFORMANCE Computing and Communications.IEEE,2015:1755-1760.
[10] LIN W H,TAN X J,LIU F J,et al.A new directional query method for polygon dataset in spatial database[J].Earth Science Informatics,2015,8(4):775-786.
[11] EZUGWU A E,FRINCU M E,JUNAIDU S B.A Multiagent-Based Approach to Scheduling of Multi-component Applications in Distributed Systems[J].Advances in Intelligent Systems and Computing,2015,347:1-12.
[12] ZHANG J,YOU S,GRUENWALD L.Large-scale spatial data processing on GPUs and GPU-accelerated clusters[J].Sigspatial Special,2015,6(3):27-34.
[13] YANG J.From Google File System to Omega:A Decade of Advancement in Big Data Management at Google[C]//IEEE First International Conference on Big Data Computing Service and Applications.IEEE,2015:249-255.
[14] WANG Y,ZHANG L,TAN J,et al.HydraDB:a resilient RDMA-driven key-value middleware for in-memory cluster computing[C]//2015 SC-International Conference for High Perfor-mance Computing,Networking,Storage and Analysis.IEEE,2017:22.
[15] WANG X,WU Y,JIANG X H,et al.Incremental Parallel Fast Clustering Based on DBSCAN Algorithm Under Large-scale Data Sets[J].Computer Applications and Software,2018,35(4):269-280.
[16] APILETTI D,GARZA P,PULVIRENTI F.A Review of Scalable Approaches for Frequent Itemset Mining[C]//East European Conference on Advances in Databases and Information Systems.Springer International Publishing,2015:243-247.
[17] WANG B,ZHANG C,SONG L,et al.Design and optimization of DBSCAN Algorithm based on CUDA[J].Computer Science,2015,40(5):553-556.
[18] CHEN R,ZHANG Y,ZHANG J,et al.Design and Optimizations of the MD5 Crypt Cracking Algorithm Based on CUDA[C]//International Conference on Cloud Computing.Springer International Publishing,2014:155-164.
[19] NING J F.DBSCAN Text Clustering Algorithm Based on Spark Framework[J].Journal of Shantou University (Natural Science Edition),2018,33(2):73-80.
[20] PENG X Y,YANG Y B,WANG C D,et al.An Efficient Parallel Nonlinear Clustering Algorithm Using MapReduce[C]//2016 IEEE International Parallel and Distributed Processing Sympo-sium Workshops.IEEE,2016:1473-1476.
[1] 柴慧敏, 张勇, 方敏.
基于特征相似度聚类的空中目标分群方法
Aerial Target Grouping Method Based on Feature Similarity Clustering
计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203
[2] 陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香.
面向国产异构众核架构的CFD非结构网格计算并行优化方法
Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture
计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157
[3] 张亚迪, 孙悦, 刘锋, 朱二周.
结合密度参数与中心替换的改进K-means算法及新聚类有效性指标研究
Study on Density Parameter and Center-Replacement Combined K-means and New Clustering Validity Index
计算机科学, 2022, 49(1): 121-132. https://doi.org/10.11896/jsjkx.201100148
[4] 戴宏亮, 钟国金, 游志铭, 戴宏明.
基于Spark的舆情情感大数据分析集成方法
Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark
计算机科学, 2021, 48(9): 118-124. https://doi.org/10.11896/jsjkx.210400280
[5] 张仁杰, 陈伟, 杭梦鑫, 吴礼发.
基于变分自编码器的不平衡样本异常流量检测
Detection of Abnormal Flow of Imbalanced Samples Based on Variational Autoencoder
计算机科学, 2021, 48(7): 62-69. https://doi.org/10.11896/jsjkx.200600022
[6] 俞建业, 戚湧, 王宝茁.
基于Spark的车联网分布式组合深度学习入侵检测方法
Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark
计算机科学, 2021, 48(6A): 518-523. https://doi.org/10.11896/jsjkx.200700129
[7] 李杉, 许新征.
基于双角度并行剪枝的VGG16优化方法
Parallel Pruning from Two Aspects for VGG16 Optimization
计算机科学, 2021, 48(6): 227-233. https://doi.org/10.11896/jsjkx.200800016
[8] 傅天豪, 田鸿运, 金煜阳, 杨章, 翟季冬, 武林平, 徐小文.
一种面向构件化并行应用程序的性能骨架分析方法
Performance Skeleton Analysis Method Towards Component-based Parallel Applications
计算机科学, 2021, 48(6): 1-9. https://doi.org/10.11896/jsjkx.201200115
[9] 何亚茹, 庞建民, 徐金龙, 朱雨, 陶小涵.
基于神威平台的Floyd并行算法的实现和优化
Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform
计算机科学, 2021, 48(6): 34-40. https://doi.org/10.11896/jsjkx.201100051
[10] 冯凯, 马鑫玉.
(n,k)-冒泡排序网络的子网络可靠性
Subnetwork Reliability of (n,k)-bubble-sort Networks
计算机科学, 2021, 48(4): 43-48. https://doi.org/10.11896/jsjkx.201100139
[11] 汤鑫瑶, 张正军, 储杰, 严涛.
基于自然最近邻的密度峰值聚类算法
Density Peaks Clustering Algorithm Based on Natural Nearest Neighbor
计算机科学, 2021, 48(3): 151-157. https://doi.org/10.11896/jsjkx.200100112
[12] 胡蓉, 阳王东, 王昊天, 罗辉章, 李肯立.
基于GPU加速的并行WMD算法
Parallel WMD Algorithm Based on GPU Acceleration
计算机科学, 2021, 48(12): 24-28. https://doi.org/10.11896/jsjkx.210600213
[13] 王茂光, 杨行.
一种基于AP-Entropy选择集成的风控模型和算法
Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble
计算机科学, 2021, 48(11A): 71-76. https://doi.org/10.11896/jsjkx.210200110
[14] 王卫东, 徐金慧, 张志峰, 杨习贝.
基于密度峰值聚类的高斯混合模型算法
Gaussian Mixture Models Algorithm Based on Density Peaks Clustering
计算机科学, 2021, 48(10): 191-196. https://doi.org/10.11896/jsjkx.200800191
[15] 张煜, 陆亿红, 黄德才.
基于密度峰值的加权犹豫模糊聚类算法
Weighted Hesitant Fuzzy Clustering Based on Density Peaks
计算机科学, 2021, 48(1): 145-151. https://doi.org/10.11896/jsjkx.200400043
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!