基于影响空间的稳健密度峰值聚类算法

doi:10.11896/jsjkx.181001846

摘要/Abstract

摘要： DP(Clustering by Fast Search and Find of Density Peaks)是一种新提出的基于局部密度和距离的聚类算法,具有能够发现任意形状的类簇、易于理解并且可以高效划分数据的优点。但是该算法无法处理单个类簇中同时存在多个密度峰值的情况,并且数据划分不稳定,容易导致连锁错误划分;当类簇间的密度差异较大时,其无法准确识别稀疏的类簇。为弥补以上不足,提出一种基于影响空间的稳健密度峰值聚类算法。该改进算法通过邻近数据计算局部密度,增强对小规模类簇的识别能力。为了提高数据划分的稳定性,引入了影响空间,并定义了一种新的对称关系,提出了一种新的分配策略。其通过计算目标数据与邻近数据的局部密度比值,并对影响空间进行加权,使算法能够处理具有多密度分布特征的数据。基于人工合成数据集和UCI数据集的模拟对比实验表明,提出的改进策略增强了算法对稀疏类簇的识别能力,提高了数据划分的稳定性,在NMI和Acc评价指标方面取得了较优的结果。

关键词: 分配策略, 局部密度, 聚类算法, 稳定性, 影响空间

Abstract: DP (Clustering by Fast Search and Find of Density Peaks) is a novel clustering algorithm based on local density and distance between data,which has the advantages of being able to discover the clusters of arbitrary shapes,easy to accept logic principle and efficiently partitioning data into data sets.However,this algorithm has some disadvantages,for instance,it can’t deal with the case when multiple density peaks co-exist in a single cluster,and it is characterized with the unstable class label allocation.At the same time,it can’t accurately identify the clusters of sparse data when the density difference between clusters is big enough.In order to overcome the above deficiencies,this paper proposed a robust density peak clustering algorithm based on influence space.The proposed algorithm is capable of calculating local density by neighboring data and enhances the ability to recognize small-scale clusters.In order to improve the robustness of data partition,the data influence space is introduced,a new symmetric relationship is defined,and a new allocation strategy is proposed.By calculating the weighted density ratio of the target data to the adjacent data,the algorithm is able to process data sets with multiple density distribution features.The local density is obtained by calculating the density ratio of the target data to the neighbor data,and the ratio is weighted through using the size of the influence space.In order to verify the availability of the proposed algorithm,simulation experiments were conducted on synthetic data sets and UCI data sets,and NMI and Acc indexs were used to evaluate clustering algorithms.Simulation experiment results show that the proposed algorithm can reduce the influence of the cutoff distance on the algorithm and classify the data more stably.Compared with other algorithms,the proposed algorithm achieves better experimental results on both NMI and Acc indexs.

Key words: Allocation strategy, Clustering algorithm, Influence space, Local density, Robustness

中图分类号:

TP391

陈春涛, 陈优广. 基于影响空间的稳健密度峰值聚类算法[J]. 计算机科学, 2019, 46(11): 216-221. https://doi.org/10.11896/jsjkx.181001846

CHEN Chun-tao, CHEN You-guang. Influence Space Based Robust Fast Search and Density Peak Clustering Algorithm[J]. Computer Science, 2019, 46(11): 216-221. https://doi.org/10.11896/jsjkx.181001846

参考文献

[1]MACQUEEN J.Some Methods for Classification and Analysisof MultiVariate Observations[C]∥Berkeley Symposium on Mathematical Statistics and Probability.Berkeley:University of California Press,1966:281-297.
[2]KARYPIS G,HAN E H,KUMAR V.CHAMELEON.A hierarchical clustering algorithm using dynamic modeling[J].Computer,1999,32(8):68-75.
[3]ESTER M,KRIEGEL H P,XU X.A Density-based Algorithmfor Discovering Clusters in Large Spatial Databases with Noise[C]∥International Conference on Knowledge Discovery and Data Mining.Palo Alto:AAAI Press,1996:226-231.
[4]WANG W,YANG J,MUNTZ R R.STING:A Statis-tical Information Grid Approach to Spatial Data Mining[C]∥Interna-tional Conference onVery Large Data Bases.San Mateo:Morgan Kaufmann.1997:186-195.
[5]HU Q,WU J,BAI L,et al.Fast K-means for Large Scale Clustering[C]∥Conference on Information and Knowledge Management.New York:ACM,2017:2099-2102.
[6]PARK H S,JUN C H.A Simple and Fast Algo-rithm for K-medoids Clustering[J].Expert Systems with Applications,2009,36(2):3336-3341.
[7]HE Y,TAN H,LUO W,et al.MR-DBSCAN:An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce[C]∥International Conference on Parallel and Distributed Systems.New York:IEEE,2012:473-480.
[8]JAHIRABADKAR S,KULKARNI P.Algorithm to Determine $\ell$-distance Parameter in Density based Clustering[J].Expert Systems with Ap-plications,2014,41(6):2939-2946.
[9]NANDA S J,PANDA G.Design of Computationally Efficient Density-based Clustering Algorithms[J].Data & Knowledge Engineering,2015,95:23-38.
[10]BAI X,YANG P,SHI X.An Overlapping Community Detection Algorithm Based on Density Peaks[J].Neurocomputing,2016,226(22):7-15.
[11]WANG H Y,ZHANG T F,MA F M.Rough K-means Algorithm with Self-adaptive Weights Measurement Based on Space Distance[J].Computer Science,2018,45(7):190-196.(in Chinese)
王慧研,张腾飞,马福民.基于空间距离自适应权重度量的粗糙K-means算法[J].计算机科学,2018,45(7):190-196.
[12]RODRIGUEZ A,LAIO A.Machine Learning.Clustering byFast Search and Find of Density Peaks[J].Science,2014,344(6191):1492-1496.
[13]ZHANG Y,XIA Y,LIU Y,et al.Clustering Sentences withDensity Peaks for Multidocument Summarization[C]∥Confe-rence of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2015:1262-1267.
[14]ZHANG Y,CHEN S.Efficient distributed density peaks forclustering large data sets in mapreduce[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(12):3218-3230.
[15]XU J,WANG G,LI T,et al.Fat node leading tree for datastream clustering with density peaks[J].Knowledge-Based Systems,2017,120:99-117.
[16]CHEN Y,MA S,CHEN X,et al.Hyperspectral data clustering based on density analysis-ensemble[J].Remote Sensing Letters,2017,8(2):194-203.
[17]LI Z J,TANG Y C.Comparative density Peaks Clustering[J].Expert Systemswith Applications,2018,95:236-247.
[18]DING J,CHEN Z,HE X,et al.Clustering by Finding Density Peaks Based on Chebyshev’s inequality[C]∥ControlConfe-rence.New York:IEEE,2016:7169-7172.
[19]GAO S Y,ZHOU X F,LI S.Clustering by Fast Search and Find of Density Peaks Based on Density Ratio[J].Computer Engineering and Applications,2017,53(16):10-17.(in Chinese)
高诗莹,周晓锋,李帅.基于密度比例的密度峰值聚类算法[J].计算机工程与应用,2017,53(16):10-17.
[20]WANG S L,WANG D K,LI C Y,et al.Clustering by Fast Search and Find of Density Peaks with Data Field[J].Chinese Journal of Electronics,2016,25(3):397-402.
[21]LIU Y,MA Z,YU F.Adaptive Density Peak Clustering Based on K-nearest Neighbors with Aggregating Strategy[J].Know-ledge Based Systems,2017,133:208-220.
[22]JIN W,TUNG A K H,HAN J,et al.Ranking Outliers Using Symmetric Neighborhood Relationship[C]∥Pacific-Asia Conference on Knowledge Discovery and Data Mining.Berlin Heidelberg:Springer,2006:577-593.

相关文章 15

[1]	柴慧敏, 张勇, 方敏. 基于特征相似度聚类的空中目标分群方法 Aerial Target Grouping Method Based on Feature Similarity Clustering 计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203
[2]	郑文萍, 刘美麟, 杨贵. 一种基于节点稳定性和邻域相似性的社区发现算法 Community Detection Algorithm Based on Node Stability and Neighbor Similarity 计算机科学, 2022, 49(9): 83-91. https://doi.org/10.11896/jsjkx.220400146
[3]	张亚迪, 孙悦, 刘锋, 朱二周. 结合密度参数与中心替换的改进K-means算法及新聚类有效性指标研究 Study on Density Parameter and Center-Replacement Combined K-means and New Clustering Validity Index 计算机科学, 2022, 49(1): 121-132. https://doi.org/10.11896/jsjkx.201100148
[4]	李杉, 许新征. 基于双角度并行剪枝的VGG16优化方法 Parallel Pruning from Two Aspects for VGG16 Optimization 计算机科学, 2021, 48(6): 227-233. https://doi.org/10.11896/jsjkx.200800016
[5]	张久杰, 陈超, 聂宏轩, 夏玉芹, 张丽萍, 马占飞. 基于类粒度的克隆代码群稳定性实证研究 Empirical Study on Stability of Clone Code Sets Based on Class Granularity 计算机科学, 2021, 48(5): 75-85. https://doi.org/10.11896/jsjkx.200900062
[6]	汤鑫瑶, 张正军, 储杰, 严涛. 基于自然最近邻的密度峰值聚类算法 Density Peaks Clustering Algorithm Based on Natural Nearest Neighbor 计算机科学, 2021, 48(3): 151-157. https://doi.org/10.11896/jsjkx.200100112
[7]	王茂光, 杨行. 一种基于AP-Entropy选择集成的风控模型和算法 Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble 计算机科学, 2021, 48(11A): 71-76. https://doi.org/10.11896/jsjkx.210200110
[8]	王卫东, 徐金慧, 张志峰, 杨习贝. 基于密度峰值聚类的高斯混合模型算法 Gaussian Mixture Models Algorithm Based on Density Peaks Clustering 计算机科学, 2021, 48(10): 191-196. https://doi.org/10.11896/jsjkx.200800191
[9]	张煜, 陆亿红, 黄德才. 基于密度峰值的加权犹豫模糊聚类算法 Weighted Hesitant Fuzzy Clustering Based on Density Peaks 计算机科学, 2021, 48(1): 145-151. https://doi.org/10.11896/jsjkx.200400043
[10]	徐守坤, 倪楚涵, 吉晨晨, 李宁. 基于YOLOv3的施工场景安全帽佩戴的图像描述 Image Caption of Safety Helmets Wearing in Construction Scene Based on YOLOv3 计算机科学, 2020, 47(8): 233-240. https://doi.org/10.11896/jsjkx.190600109
[11]	王萌, 丁志军. 一种新的设备指纹特征选择及模型构建方法 New Device Fingerprint Feature Selection and Model Construction Method 计算机科学, 2020, 47(7): 257-262. https://doi.org/10.11896/jsjkx.190900107
[12]	朱林立, 华钢, 高炜. 决定图框架下本体学习算法的稳定性分析 Stability Analysis of Ontology Learning Algorithm in Decision Graph Setting 计算机科学, 2020, 47(5): 43-50. https://doi.org/10.11896/jsjkx.200100129
[13]	周畅,陆慧梅,向勇,吴竞邦. 区块链在车载自组网中的应用研究及展望 Survey on Application of Blockchain in VANET 计算机科学, 2020, 47(2): 213-220. https://doi.org/10.11896/jsjkx.190600001
[14]	邓定胜. 一种改进的DBSCAN算法在Spark平台上的应用 Application of Improved DBSCAN Algorithm on Spark Platform 计算机科学, 2020, 47(11A): 425-429. https://doi.org/10.11896/jsjkx.190700071
[15]	田献珍, 孙立强, 田振中. 基于蚁群算法的图像重建 Image Reconstruction Based on Ant Colony Algorithm 计算机科学, 2020, 47(11A): 231-235. https://doi.org/10.11896/jsjkx.191000128

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed