计算机科学 ›› 2024, Vol. 51 ›› Issue (7): 124-132.doi: 10.11896/jsjkx.231000125

• 数据库&大数据&数据科学 • 上一篇    下一篇

一种基于属性相似性和分布结构连通性的聚类算法

孙浩文, 丁家满, 李博文, 贾连印   

  1. 昆明理工大学信息工程与自动化学院 昆明 650500
    云南省人工智能重点实验室(昆明理工大学) 昆明 650500
  • 收稿日期:2023-10-18 修回日期:2024-03-29 出版日期:2024-07-15 发布日期:2024-07-10
  • 通讯作者: 丁家满(tjoman@126.com)
  • 作者简介:(shw_haowen@163.com)
  • 基金资助:
    国家自然科学基金(62262034,62262035);云南省科技揭榜项目(202204BW050001)

Clustering Algorithm Based on Attribute Similarity and Distributed Structure Connectivity

SUN Haowen, DING Jiaman, LI Bowen, JIA Lianyin   

  1. Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China
    Artificial Intelligence Key Laboratory of Yunnan Province,Kunming University of Science and Technology,Kunming 650500,China
  • Received:2023-10-18 Revised:2024-03-29 Online:2024-07-15 Published:2024-07-10
  • About author:SUN Haowen,born in 1999,master,is a member of CCF(No.P9555G).His main research interests include machine learning and data mining.
    DING Jiaman,born in 1974,professor,is a member of CCF(No.77726M).His main research interests include data mining,cloud computing and machine learning.
  • Supported by:
    National Natural Science Foundation of China(62262034,62262035) andRanking the Top of the List for Science and Technology Project of Yunnan Province(202204BW050001).

摘要: 聚类分析针对不同的数据特点采用不同的相似性度量,现实世界中数据分布复杂,存在分布无规律、密度不均匀等现象,单独考虑实例属性相似性或分布结构连通性会影响聚类效果。为此,提出了一种基于属性相似性和分布结构连通性的聚类算法(A Clustering Algorithm Based on Attribute Similarity and Distributed Structure Connectivity,ASDSC)。首先,利用待聚类数据集中的所有数据实例构建完全无向图,定义了一种兼顾属性相似和分布结构连通的新颖相似性度量方式,用于计算节点相似性,并构造邻接矩阵更新边的权重;其次,借助邻接矩阵执行递增步长的随机游走,依据顶点的连通中心性来识别簇中心并给定簇编号,同时获取其他顶点的连通性;然后,利用连通性计算顶点间的依赖关系,并据此进行簇编号的传播,直至完成聚类。最后,为了验证该方法的聚类性能,在16个合成数据集和10个真实数据集上与5种先进聚类算法进行了对比实验,ASDSC算法取得了优异性能。

关键词: 聚类, 相似性度量, 属性相似性, 分布结构连通性, 簇编号传播

Abstract: According to different data characteristics,clustering analysis adopts different similarity measures.However,the data distribution is complex in the real world,and there are various phenomena such as irregular distribution and uneven density.Considering attribute similarity or distribution structure connectivity alone will reduce clustering performance.Therefore,this paper proposes a clustering algorithm based on attribute similarity and distributed structure connectivity(ASDSC).Firstly,a completely undirected graph is constructed using all data instances,and a novel similarity measurement method is defined to calculate the node similarity by the topology structure and the attributes similarity,and the adjacency matrix are constructed to update the weights of edges.Secondly,based on the adjacency matrix,random walk with increasing step is performed.Subsequently,the cluster centers and their numbers are obtained according to the connected centrality of nodes,and the connectivity of other nodes is also acquired.Then,the connectivity is used to calculate the dependencies among nodes,and the propagation process of cluster number is carried out accordingly until the clustering process is completed.Finally,comparative experiments with 5 advanced clustering algorithms are conducted on 16 synthetic datasets and 10 real datasets,and the result show that the ASDSC algorithm has achieved excellent performance.

Key words: Clustering, Similarity measure, Attribute similarity, Distributed structure connectivity, Cluster number propagation

中图分类号: 

  • TP391
[1]ZHANG Y,XIA Y Q,LIU Y,et al.Clustering sentences withdensity peaks for multi-document summarization[C]//Procee-dings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.DC:Association for Computational Linguistics,2015:1262-1267.
[2]LAHIRI S,MISRA S,SANJOY KUMAR S,et al.Clustering-Based Semi-supervised Technique for Credit Card Fraud Detection[C]//Proceedings of Computational Intelligence in Communications and Business Analytics.Springer,2022:260-268.
[3]LI C W,CHEN H M,LI T R,et al.A stable community detection approach for complex network based on density peak clustering and label propagation[J].Applied Intelligence,2022,52:1188-1208.
[4]INUWA-DUTSE I,LIPTROTT M,KORKONTZELOS I.Amultilevel clustering technique for community detection[J].Neurocomputing,2021,441:64-78.
[5]QU J L,CUI Y H.Gene set analysis with graph-embedded kernel association test[J].Bioinformatics,2022,38(6):1560-1567.
[6]GAO C M,ZHAO Y,WU R Z,et al.Semantic trajectory compression via multi-resolution synchronization-based clustering[J].Knowledge-Based Systems,2019,174:177-193.
[7]ZHAO K,CHONG P,QIANG C.Twin learning for similarityand clustering:a unified kernel approach[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence(AAAI'17).USA,2017:2080-2086.
[8]TIAN X Y,XU D C,DU D L,et al.The spherical k-means++ algorithm via local search scheme[J].Journal of Combinatorial Optimization,2022,44:2375-2394.
[9]HANAFI N,SAADATFAR H.A fast DBSCAN algorithm forbigdata based on efficient density calculation[J].Expert Systems with Applications,2022,203:117501.
[10]KUMAR U,LEGENDRE C P,LEE J C,et al.On analyzingGNSS displacement field variability of Taiwan:Hierarchical Agglomerative Clustering based on Dynamic Time Warping technique[J].Computers & Geosciences,2022,169:105243.
[11]LI H M,YE X C,IMAKURA A,et al.Divide-and-conquer based Large-Scale Spectral Clustering[J].Neurocomputing,2022,501:664-678.
[12]FRÄNTI P,SIERANOJA S.How much can k-means be im-proved by using better initialization and repeats?[J].Pattern Recognition,2019,93:95-112.
[13]GUPTA A,DATTA S,DAS S.Fast automatic estimation of the number of clusters from the minimum inter-center distance for k-means clustering[J].Pattern Recognition Letters,2018,116:72-79.
[14]OSKOUEI A G,BALAFAR M A,MOTAMED C.FK-MAWCW:Categorical fuzzy k-modes clustering with automated attribute-weight and cluster-weight learning[J].Chaos,Solitons &Fractals,2021,153(1):111494.
[15]YANG Y M,LIU H M,GUAN Z Y,et al.CoHomo:A cluster-attribute correlation aware graph clustering framework[J].Neurocomputing,2020,412:327-338.
[16]SCHUBERT E,SANDER J,ESTER M,et al.DBSCAN Revisi-ted,Revisited:Why and How You Should(Still)Use DBSCAN[J].ACM Transactions on Database Systems,2017,42(3):19-21.
[17]ABBAS M,EL-ZOGHABI A,SHOUKRY A.DenMune:Density peak based clustering using mutual nearest neighbors[J].Pattern Recognition,2021,109:107589.
[18]ZHU X T,LOY C C,GONG S G.Constructing Robust Affinity Graphs for Spectral Clustering[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition.OH:IEEE,2014:1450-1457.
[19]WANG Y,MA Y,HUANG H,et al.A split-merge clustering algorithm based on the k-nearest neighbor graph[J].Information Systems,2023,111:102124.
[20]CAI Y D,HUANG Z X,YIN J F.A new method to build theadaptive k-nearest neighbors similarity graph matrix for spectral clustering[J].Neurocomputing,2022,493:191-203.
[21]ULTSCH A,LÖTSCH J.The fundamental clustering and projection suite(fcps):a dataset collection to test the performance of clustering and data projection algorithms[J].Data,2020,5(1):13.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!