计算机科学 ›› 2024, Vol. 51 ›› Issue (8): 97-105.doi: 10.11896/jsjkx.230500226

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于共享最近邻的自适应密度峰值聚类算法

王心耕1, 杜韬1,2, 周劲1,2, 陈迪1, 仵匀政1   

  1. 1 济南大学信息科学与工程学院 济南 250024
    2 山东省网络环境智能计算技术重点实验室 济南 250024
  • 收稿日期:2023-05-31 修回日期:2023-10-18 出版日期:2024-08-15 发布日期:2024-08-13
  • 通讯作者: 杜韬(ise_dut@ujn.edu.cn)
  • 作者简介:(1054493071@qq.com)
  • 基金资助:
    国家自然科学基金(62273164);山东省自然科学基金联合基金(ZR2020LZH009)

Adaptive Density Peak Clustering Algorithm Based on Shared Nearest Neighbor

WANG Xingeng1, DU Tao1,2, ZHOU Jin1,2, CHEN Di1, WU Yunzheng1   

  1. 1 College of Information Science and Engineering,University of Jinan,Jinan 250024,China
    2 Shandong Provincial Key Laboratory of Network Based Intelligent Computing,Jinan 250024,China
  • Received:2023-05-31 Revised:2023-10-18 Online:2024-08-15 Published:2024-08-13
  • About author:WANG Xingeng,born in 1999,postgra-duate.His main research interests include data clustering and data mining.
    DU Tao,born in 1979,Ph.D,associate professor.His main research interests include data clustering and data mining.
  • Supported by:
    National Natural Science Foundation of China(62273164) and Joint Fund of Natural Science Foundation of Shandong Province,China(ZR2020LZH009).

摘要: 密度峰值聚类算法(DPC)是一种简单高效的无监督聚类算法,该算法虽能自动发现簇中心,实现任意形状数据的高效聚类,但依然存在一些缺陷。针对密度峰值聚类算法在定义相关度量值时未考虑数据的位置信息、聚类中心数目需要人工预先设定且分配样本点时易出现连锁反应这3个缺陷,提出一种基于共享最近邻的自适应密度峰值聚类算法。首先,利用共享最近邻重新定义局部密度等度量值,充分考虑了数据分布的局部特点,使样本点的空间分布特征得以更好地体现;其次,通过引入密度衰减现象让样本点自动聚集成微簇,实现了簇个数自适应确定和簇中心自适应选取;最后,提出一种两阶段的分配方法,先将微簇合并形成簇的主干部分,再用上一步分配好的簇主干指导剩余点的分配,避免了链式反应的发生。在二维合成数据集以及UCI数据集上的实现表明,相较于经典的密度峰值聚类算法及近年来对其提出的改进算法,在大多数情况下,所提算法表现出更优异的性能。

关键词: 共享最近邻, 密度峰值聚类, 分配策略, 聚类中心, 密度衰减

Abstract: Density peak clustering algorithm(DPC) is a simple and efficient unsupervised clustering algorithm.Although the algorithm can automatically discover cluster centers and realize efficient clustering of arbitrary shape data,it still has some defects.Aiming at the three defects of density peak clustering algorithm,which does not consider the location information of data when defining the correlation value,the number of clustering centers needs to be set manually in advance,and the chain reaction is easy to occur when distributing sample points,an adaptive density peak clustering algorithm based on shared nearest neighbor is proposed.Firstly,the shared nearest neighbor is used to redefine the local density and other measures,and the local characteristics of data distribution are fully considered,so that the spatial distribution characteristics of sample points can be better reflected.Se-condly,by introducing the phenomenon of density attenuation,the sample points are automatically gathered into micro-clusters,which realizes the adaptive determination of cluster number and the adaptive selection of cluster center.Finally,a two-stage distribution method is proposed,in which the micro-clusters are merged to form the backbone of the cluster,and then the backbone of the cluster allocated in the previous step guides the distribution of the remaining points,avoiding the occurrence of chain reactions.The implementation on two dimensional composite datasets and UCI datasets shows that this algorithm has better perfor-mance in most cases than the classical density peak clustering algorithm and its improved algorithms in recent years.

Key words: Shared nearest neighbor, Density peak clustering, Allocation strategy, Cluster center, Density decay

中图分类号: 

  • TP391
[1]SUN L,QIN X Y,XU J C,et al.Density Peak Clustering Algorithm Based on K-Nearest Neighbors and Optimal Assignment Strategy[J].Journal of Software,2022,33(4):1390-141.
[2]CERQUITELLI T,VENTURA F,APILETTI D,et al.Enhancing manufacturing intelligence through an unsupervised data-driven methodology for cyclic industrial processes[J].Expert Systems with Applications,2021,182(3):115269.
[3]YANG L,CHEUNG Y M,YUAN Y T.Self-Adaptive Multiprototype-Based Competitive Learning Approach:Ak-Means-Type Algorithm for Imbalanced Data Clustering[J].IEEE Transactions on Cybernetics,2019,51(3):1598-1612.
[4]AHMAD A,KHAN S S.initKmix-A novel initial partitiongeneration algorithm for clustering mixed data using k-means-based clustering[J].Expert Systems with Applications,2020,167(2):114149.
[5]JIANG J T,ZHENG C H.Density Peak and Grid Based Clustering for Large-scale Node Partition[J].Journal of Chinese Mini-Micro Computer Systems,2022,43(3):498-505.
[6]SUN L,LIANG Y Q.Improved Clustering Algorithm FusingGrid Partition and DBSCAN[J].Computer Engineering and Applications,2022,58(14):73-79.
[7]HU C A,WANG J X,MAO Y M.Density-based clustering algorithm based on groups and improve gravitational search[J].Application Research of Computers,2021,38(11):3293-3299.
[8]GUO W J,WANG W H,ZHAO S P,et al.Density Peak Clustering with connectivity estimation[J].Knowledge-Based Systems,2022,243(5):108501.
[9]ZHANG T,RAMAKRISHNAN R,LIVNY M.BIRCH:an efficient data clustering method for very large databases[J].ACM Sigmod Record,1996,25(2):103-114.
[10]WANG R,ZHOU J,JIANG H,et al.A general transfer lear-ning-based Gaussian mixture modelfor clustering[J].Interna-tional Journal of Fuzzy Systems,2021,23(3):776-793.
[11]LI K,ZHANG K X.Structural α-Entropy Weighting Gaussian Mixture Model for Subspace Clustering[J].Chinese Journal of Electronics,2022,50(3):718-725.
[12]HARTIGAN J A,WONG M A.Algorithm AS 136:A K-Means Clustering Algorithm[J].Journal of the Royal Statistical Society,1979,28(1):100-108.
[13]ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.1996:226-231.
[14]RODRIGUEZ A,LAIO A.Clustering by fast search and find of density peaks[J].Science,2014,344(6191):1492-1496.
[15]WEI L,GAO L,LI J H,et al.Traffic sub-area division method based on density peak clustering[J].Journal of Jilin University(Engineering and Technology Edition),2023,53(1):124-131.
[16]WANG F Y,ZHANG D S,XIAO Y T.Density Peak Algorithm Based on Weighted Shared Nearest Neighbor and Accumulated Sequence[J].Computer Engineering,2022,48(4):61-69.
[17]ZHANG X Y,YUN W G.Sharing K-nearest Neighbors andMultiple Assignment PoliciesDensity Peaks Clustering Algorithm[J].Journal of Chinese Computer Systems,2023,44(1):75-82.
[18]DU M,DING S,JIA H.Study on density peaks clustering based on k-nearest neighbors and principal component analysis[J].Knowledge-Based Systems,2016,99(5):135-145.
[19]JIANG J,CHEN Y,MENG X,et al.A novel density peaks clustering algorithm based on k nearest neighbors for improving assignment process[J].Physica A:Statistical Mechanics and Its Applications,2019,523(6):702-713.
[20]LIU R,WANG H,YU X.Shared-nearest-neighbor-based clustering by fast search and find of density peaks[J].Information Sciences,2018,450:200-226.
[21]CHEN J,YU P.A domain adaptive density clustering algorithm for data with varying density distribution[J].IEEE Transactions on Knowledge and Data Engineering,2021,33(6):2310-2321.
[22]LIU Y H,MA Z M,YU F.Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy[J].Knowledge-Based Systems,2017,133(10):208-220.
[23]ZHANG Z,ZHU Q,ZHU F,et al.Density decay graph-based density peak clustering[J].Knowledge-Based Systems,2021,224:107075.
[24]XIE J,GAO H,XIE W,et al.Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors[J].Information Sciences,2016,354(8):19-40.
[25]SUN L,QIN X,DING W,et al.Nearest neighbors-based adaptive density peaks clustering with optimized allocation strategy[J].Neurocomputing,2022,473:159-181.
[26]GUO W,WANG W,ZHAO S,et al.Density peak clusteringwith connectivity estimation[J].Knowledge-Based Systems,2022,243:108501.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!