计算机科学 ›› 2021, Vol. 48 ›› Issue (3): 151-157.doi: 10.11896/jsjkx.200100112

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于自然最近邻的密度峰值聚类算法

汤鑫瑶, 张正军, 储杰, 严涛   

  1. 南京理工大学理学院 南京210094
  • 收稿日期:2020-01-07 修回日期:2020-06-09 出版日期:2021-03-15 发布日期:2021-03-05
  • 通讯作者: 张正军(zzjnj@163.com)
  • 作者简介:1015562103@qq.com
  • 基金资助:
    国家自然科学基金(11671205)

Density Peaks Clustering Algorithm Based on Natural Nearest Neighbor

TANG Xin-yao, ZHANG Zheng-jun, CHU Jie, YAN Tao   

  1. School of Science,Nanjing University of Science and Technology,Nanjing 210094,China
  • Received:2020-01-07 Revised:2020-06-09 Online:2021-03-15 Published:2021-03-05
  • About author:TANG Xin-yao,born in 1996,postgra-duate.His main research interests include data mining and mechine lear-ning.
    ZHANG Zheng-jun,born in 1965,Ph.D,associate professor.His main research interests include data mining and gra-phic image.
  • Supported by:
    National Natural Science Foundation of China(11671205).

摘要: 针对密度峰值聚类算法(Density Peaks Clustering,DPC)需要人为指定截断距离dc,以及局部密度定义简单和一步分配策略导致算法在复杂数据集上表现不佳的问题,提出了一种基于自然最近邻的密度峰值聚类算法(Density Peaks Clustering based on Natural Nearest Neighbor,NNN-DPC)。该算法无需指定任何参数,是一种非参数的聚类方法。该算法首先根据自然最近邻的定义,给出新的局部密度计算方法来描述数据的分布,揭示内在的联系;然后设计了两步分配策略来进行样本点的划分。最后定义了簇间相似度并提出了新的簇合并规则进行簇的合并,从而得到最终聚类结果。实验结果表明,在无需参数的情况下,NNN-DPC算法在各类数据集上都有优秀的泛化能力,对于流形数据或簇间密度差异大的数据能更加准确地识别聚类数目和分配样本点。与DPC、FKNN-DPC(Fuzzy Weighted K-nearest Density Peak Clustering)以及其他3种经典聚类算法的性能指标相比,NNN-DPC算法更具优势。

关键词: 局部密度, 聚类算法, 密度峰值, 自然最近邻居

Abstract: Aiming at the problem that the density peak clustering (DPC) algorithm requires manually selected parameters (cutoff distance dc),as well as the problem of a poor performance on complex data sets caused by the simple definition of local density and the one-step assignment strategy,a new density peak clustering algorithm based on natural nearest neighbors (NNN-DPC) is proposed.The algorithm does not need to specify any parameters and is a non-parametric clustering method.Based on the definition of natural nearest neighbors,this algorithm firstly gives a new local density calculation formula to describe the distribution of data,and reveals the internal connection.A two-step assignment strategy is designed to divide the sample points.Finally,the similarity between clusters is defined,and a new cluster merging rule is proposed to merge the clusters to obtain the final clustering result.The experimental results show that without parameters,the NNN-DPC algorithm has excellent generalization ability on various types of data sets,and can more accurately identify the number and distribution of clusters on manifold data or data with large differences of density between clusters,and assign sample points to the corresponding clusters.Compared with the perfor-mance indicators of DPC,FKNN-DPC,and three other classic clustering algorithms,the NNN-DPC algorithm has a great advantage.

Key words: Clustering algorithm, Density peaks, Local density, Natural nearest neighbor

中图分类号: 

  • TP301.6
[1]MACQUEEN J.Some Methods for Classificati-on and Analysis of MultiVariate Observations[C]//Proc of Berkeley Sympo-sium on Mathematical Statistics & Probability.1967,1(14):281-297.
[2]KAUFMAN L,ROUSSEEUW P.Clustering by means of medoids[C]//Statistical Data Analysis Based on the L1-norm & Related Methods.North-Holland,1987:405-416.
[3]FREY B J,DUECK D.Clustering by Passing Messages Between Data Points [J].Science,2007,315(5814):972-976.
[4]ESTER M,KRIEGEL H P,SANDER J,et al.A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise[C]//Internation-al Conference on Knowledge Discovery & Data Mining.1996:226-231.
[5]RODRIGUEZ A,LAIO A.Clustering by fast search and find of density peaks [J].Science,2014,344(6191):1492-1496.
[6]YANG Z,WANG H J,ZHOU Y.A clusteri-ng algorithm based on cutoff distance and adaptive cluster center [J].Data Analysis and Knowledge Discovery,2018,2(3):39-48.
[7]WANG Y,ZHANG G Z.Density peak algorithm by automatically determining cluster centers [J].Computer Engineering and Applications,2018,54(8):137-142.
[8]LIU R,WANG H,YU X.Shared-nearest-neighbor-based clustering by fast search and find of density peaks[J].Information Sciences,2018,450:200-226.
[9]DU M J,DING S F,JIA H J.Study on density peaks clustering based on k-nearest neighbors and principal component analysis[J].Knowledge-Based Systems,2016,99:135-145.
[10]XIE J Y,GAO H C,XIE W X,et al.Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors[J].Information Sciences,2016,345:19-40.
[11]YANG L J,ZHU Q S,HUANG J L,et al.Adap-tive edited na-tural neighbor algorithm[J].Neurocomputing,2017,230:427-433.
[12]HUANG J L,ZHU Q S,YANG L J,et al.QCC:a novel clustering algorithm based on quasicluster centers[J].Machine Lear-ning,2017,106(3):337-357.
[13]ZHU Q S,FENG J,HUANG J L.Natural neighbor:a self adaptive neighborhood method without parameter K[J].Pattern Recognition Letters,2016,80:30-36.
[14]STEVENS S S.Mathematics,measurement,and psychophysics[C]//Handbook of Experimental Psychology.London:Wiley,1951:1-49.
[15]ZOU X L.Application of Natural Nearest Neighbor in High-dimensional Data Structure Learning[D].Chongqing:Chongqing University,2011.
[16]HUANG J L.Research on Parameterless Clustering Algorithm Based on Natural Nearest Neighbor[D].Chongqing:Chongqing University,2014.
[17]VINH N X,EPPS J,BAILEY J.Information theoretic measures for clusterings comparison:Variants,properties,normalization and correction for chance[J].Journal of Machine Learning Research,2010,11(Oct):2837-2854.
[18]FOWLKES E B,MALLOWS C L.A method for comparing two hierarchical clusterings[J].Journal of the American Statistical Association,1983,78(383):553-569.
[19]PEDREGOSA F,VAROQUAUX G,GRAMFORT A,et al.Scikit-learn:Machine learning in Python[J].Journal of Machine Learning Research,2011,12(Oct):2825-2830.
[20]CHANG H,YEUNG D Y.Robust path-based spectral cluste-ring[J].Pattern Recognition,2008,41(1):191-203.
[21]GIONIS A,MANNILA H,TSAPARAS P.Clusteri-ng aggregation[J].ACM Transactions on Knowledge Discovery from Data (TKDD),2007,1(1):4.
[22]VEENMAN C J,REINDERS M J T,BACKER E.A maximum variance cluster algorithm[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2002,24(9):1273-1280.
[23]FRANTI P,VIRMAJOKI O,HAUTAMAKI V.Fast agglomera-tive clustering using a k-nearest neighbor graph[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2006,28(11):1875-1881.
[24]BACHE K,LICHMAN M.UCI machine learning repository[EB/OL].[2019-12-20].http://archive.ics.uci.edu/ml.
[1] 柴慧敏, 张勇, 方敏.
基于特征相似度聚类的空中目标分群方法
Aerial Target Grouping Method Based on Feature Similarity Clustering
计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203
[2] 张亚迪, 孙悦, 刘锋, 朱二周.
结合密度参数与中心替换的改进K-means算法及新聚类有效性指标研究
Study on Density Parameter and Center-Replacement Combined K-means and New Clustering Validity Index
计算机科学, 2022, 49(1): 121-132. https://doi.org/10.11896/jsjkx.201100148
[3] 李杉, 许新征.
基于双角度并行剪枝的VGG16优化方法
Parallel Pruning from Two Aspects for VGG16 Optimization
计算机科学, 2021, 48(6): 227-233. https://doi.org/10.11896/jsjkx.200800016
[4] 王茂光, 杨行.
一种基于AP-Entropy选择集成的风控模型和算法
Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble
计算机科学, 2021, 48(11A): 71-76. https://doi.org/10.11896/jsjkx.210200110
[5] 王卫东, 徐金慧, 张志峰, 杨习贝.
基于密度峰值聚类的高斯混合模型算法
Gaussian Mixture Models Algorithm Based on Density Peaks Clustering
计算机科学, 2021, 48(10): 191-196. https://doi.org/10.11896/jsjkx.200800191
[6] 张煜, 陆亿红, 黄德才.
基于密度峰值的加权犹豫模糊聚类算法
Weighted Hesitant Fuzzy Clustering Based on Density Peaks
计算机科学, 2021, 48(1): 145-151. https://doi.org/10.11896/jsjkx.200400043
[7] 徐守坤, 倪楚涵, 吉晨晨, 李宁.
基于YOLOv3的施工场景安全帽佩戴的图像描述
Image Caption of Safety Helmets Wearing in Construction Scene Based on YOLOv3
计算机科学, 2020, 47(8): 233-240. https://doi.org/10.11896/jsjkx.190600109
[8] 张琴, 陈红梅, 封云飞.
一种基于粗糙集和密度峰值的重叠社区发现方法
Overlapping Community Detection Method Based on Rough Sets and Density Peaks
计算机科学, 2020, 47(5): 72-78. https://doi.org/10.11896/jsjkx.190400160
[9] 陈俊芬,张明,赵佳成.
复杂高维数据的密度峰值快速搜索聚类算法
Clustering Algorithm by Fast Search and Find of Density Peaks for Complex High-dimensional Data
计算机科学, 2020, 47(3): 79-86. https://doi.org/10.11896/jsjkx.190400123
[10] 田献珍, 孙立强, 田振中.
基于蚁群算法的图像重建
Image Reconstruction Based on Ant Colony Algorithm
计算机科学, 2020, 47(11A): 231-235. https://doi.org/10.11896/jsjkx.191000128
[11] 邓定胜.
一种改进的DBSCAN算法在Spark平台上的应用
Application of Improved DBSCAN Algorithm on Spark Platform
计算机科学, 2020, 47(11A): 425-429. https://doi.org/10.11896/jsjkx.190700071
[12] 李晓光, 邵超.
基于网格数据中心的密度峰值聚类算法
Density Peak Clustering Algorithm Based on Grid Data Center
计算机科学, 2019, 46(6A): 457-460.
[13] 张建新, 刘弘, 李焱.
一种面向人群疏散的高效分组方法
Efficient Grouping Method for Crowd Evacuation
计算机科学, 2019, 46(6): 231-238. https://doi.org/10.11896/j.issn.1002-137X.2019.06.035
[14] 李长镜,赵书良,池云仙.
一种基于谱嵌入和局部密度的离群点检测算法
Outlier Detection Algorithm Based on Spectral Embedding and Local Density
计算机科学, 2019, 46(3): 260-266. https://doi.org/10.11896/j.issn.1002-137X.2019.03.039
[15] 胡闯, 杨庚, 白云璐.
面向差分隐私保护的聚类算法
Clustering Algorithm in Differential Privacy Preserving
计算机科学, 2019, 46(2): 120-126. https://doi.org/10.11896/j.issn.1002-137X.2019.02.019
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!