计算机科学 ›› 2023, Vol. 50 ›› Issue (1): 9-17.doi: 10.11896/jsjkx.211000185
王少将1, 刘佳2, 郑锋3, 潘祎诚3
WANG Shaojiang1, LIU Jia2, ZHENG Feng3, PAN Yicheng3
摘要: 聚类分析在机器学习、数据挖掘、生物DNA信息等方面都起着极为关键的作用。聚类算法从方法学上可分为扁平聚类和层谱聚类。扁平聚类通常将数据集分为K个并行社区,社区之间没有交集,但现实世界的社区之间多具有不同层次之间的包含关系,因而层谱聚类算法能对数据进行更精细的分析,提供更好的可解释性。而相比扁平聚类,层谱聚类研究进展缓慢。针对层谱聚类面临的问题,从对代价函数的选择、聚类结果衡量指标、聚类算法性能等方面入手,调研了大量的相关文献。其中聚类结果衡量指标主要有模块度、Jaccard 指数、标准化互信息、树状图纯度等。扁平聚类算法中比较经典的算法有K-means算法、标签传播算法、DBSCAN 算法、谱聚类算法等。层谱聚类算法可以进一步划分为分裂聚类算法和凝聚聚类算法,分裂层谱聚类算法有二分K-means算法和递归稀疏割算法,凝聚层谱聚类算法有经典的Louvain算法、BIRCH 算法和近年来提出的HLP 算法、PERCH算法及GRINCH算法。最后,进一步分析了这些算法的优缺点,并总结全文。
中图分类号:
[1]CHARIKAR M,CHATZIAFRATIS V.Approximate hierarchical clustering via sparsest cut and spreading metrics[C]//Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms.Society for Industrial and Applied Mathematics,2017:841-854. [2]ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]//KDD' 1996.1996:226-231. [3]NEWMAN M E J,GIRVAN M.Finding and evaluating community structure in networks[J].Physical Review E,2004,69(2):026113. [4]CILIBRASI R,VITÁNYI P M B.Clustering by compression[J].IEEE Transactions on Information Theory,2005,51(4):1523-1545. [5]JACCARD P.The distribution of the flora in the alpine zone.[J].New Phytologist,1912,11(2):37-50. [6]HELLER K A,GHAHRAMANI Z.Bayesian hierarchical clustering[C]//Proceedings of the 22nd International Conference on Machine learning.2005:297-304. [7]MANNING C D,SCHÜTZE H,RAGHAVAN P.Introductionto information retrieval[M].Cambridge:Cambridge University Press,2008. [8]DASGUPTA S.A cost function for similarity-based hierarchical clustering[C]//the 48th Annual ACM SIGACT Symposium.ACM,2016:118-127. [9]MACQUEEN J.Some methods for classification and analysis of multivariate observations[C]//Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability.1967:281-297. [10]HARTIGAN J A,WONG M A.Algorithm AS 136:A k-means clustering algorithm[J].Journal of the Royal Statistical Society,1979,28(1):100-108. [11]ACKERMANN M R,MÄRTENS M,RAUPACHC,et al.StreamKM++ A clustering algorithm for data streams[J].Journal of Experimental Algorithmics(JEA),2012,17:2.1-2.30. [12]SCULLEY D.Web-scale k-means clustering[C]//Proceedings of the 19th International Conference on World Wide Web.2010:1177-1178. [13]ZHU X,GHAHRAMANI Z.Learning from labels and unlabeled data with label propagation[J].Tech Report,2002,3175(2004):237-244. [14]REN Z W,SUN Q S,WEI D.Consensus Affinity Graph Lear-ning for Multiple Kernel Clustering[J].IEEE Transactions on Cybernetics,2021,51(6):3273-3284. [15]REN Z W,SUN Q S,WEI D.Multiple Kernel Clustering with Kernel k-Means Coupled Graph TensorLearning[C]//The Thirty-Fifth AAAI Conference on Artificial Intelligence.2021:9411-9418. [16]REN Z W,SUN Q S.Simultaneous Global and Local GraphStructure Preserving for Multiple Kernel Clustering[J].IEEE Transactions on Neural Networks and Learning Systems,2021,32(5):1839-1851. [17]TANG J,SUN J M,WANG C,et al.Social influence analysis in large-scale networks[C]//The 15th International Conference onKnowledge Discovery and Data Mining.2009:807-816. [18]TANG J,ZHANG J,YAO L M,et al.ArnetMiner:extraction and mining of academic social networks[C]//The 14th International Conference on Knowledge Discovery and Data Mining.e2008:990-998. [19]CHIANG W L,LIU X Q,SI S,et al.Cluster-gcn:An efficient algorithm for training deep and large graph convolutional networks[C]//Proceedings of the 25th ACM SIGKDD Inter-national Conference on Knowledge Discovery & Data Mining.2019:257-266. [20]WHITE S,SMYTH P.A spectral clustering approach to finding communities in graphs[C]//Proceedings of the 2005 SIAM International Conference on Data Mining.2005:274-285. [21]ZHANG L,ZHU Z G.Unsupervised feature learning for point cloud understanding by contrasting and clustering using graph convolutional neural networks[C]//2019 International Confe-rence on 3D Vision(3DV).2019:395-404. [22]LECUNY.Modeles connexionistes de l'apprentissage[D].Pa-ris:Universite de Paris,1987. [23]BOURLARD H,KAMP Y.Auto-association by multilayer perceptrons and singular value decomposition [J].Biological Cybernetics,1998(59):291-294. [24]HINTON G E,ZEMEL R S.Autoencoders,Minimum Description Length and Helmholtz Free Energy[C]//Morgan Kaufmann.1993:3-10. [25]HINTON G E,JAMES L.McClelland Learning Representations by Recirculation[C]//NIPS.1987:358-366. [26]WANG X,JI H Y,SHI C.Heterogeneous graph attention network[C]//The World Wide Web Conference.2019:2022-2032. [27]CORBETTAM,SHULMAN G L.Spatial neglect and attention networks [J].Annual Review of Neuroscience,2011(34):569-599. [28]ZHANG K H,LI T P,SHEN S W.Adaptive graph convolutionalnetwork with attention graph clustering for co-saliency detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:9050-9059. [29]MUKHERJEE S,ASNANI H,LIN E,et al.Clustergan:Latent space clustering in generative adversarial networks[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:4610-4617. [30]CHEN X,DUAN Y,REIN H,et al.Infogan:Interpretable representation learning by information maximizing generative adversarial nets[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:2180-2188. [31]LIU Z Y,WANG J H,LIANG Z W.Catgan:Category-awaregenerative adversarial networks with hierarchical evolutionary learning for category text generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2014:8425-8432. [32]STEINBACH M,KARYPIS G,KUMAR V.A comparison of document clustering techniques[C]//Text Mining Workshop at KDD 2000(May 2000).2000. [33]COHEN-ADDAD V,KANADE V,MALLMANN-TRENN F,et al.Hierarchical clustering:Objective functions and algorithms[C]//Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms.Society for Industrial and Applied Mathematics,2018:378-397. [34]ROY A,POKUTTA S.Hierarchical clustering via spreadingmetrics[J].The Journal of Machine Learning Research,2017,18(1):3077-3111. [35]ARORA S,RAO S,VAZIRANI U.Expander Flows,Geometric Embeddings and Graph Partitioning[C]//Proceedings of the 36th Annual ACM(Association for Computing Machinery) Symposium on Theory of Computing.Computer Science Department,Princeton University,2004:222-231. [36]ROSSI R A,AHMED N K,KOH E,et al.Fast Hierarchical Graph Clustering in LinearTime[C]//Companion Proceedings of the Web Conference.2020:10-12. [37]BLONDEL V D,GUILLAUME J L,LAMBIOTTE R,et al.Fast unfolding of communities in large networks[J/OL].Journal of Statistical Mechanics Theory & Experiment,2008.https://iopscience.iop.org/article/10.1088/1742-5468/2008/10/P10008. [38]NEWMAN M E J.Modularity and community structure in networks[C]//Proceedings of the National Academy of Sciences.2006:8577-8582. [39]ZHANG T,RAMAKRISHNAN R,LIVNY M.BIRCH:an efficient data clustering method for very large databases[J].ACM Sigmod Record,1996,25(2):103-114. [40]KOBREN A,MONATH N,KRISHNAMURTHY A,et al.A hierarchical algorithm for extreme clustering[C]//Proceedings of the 23rd ACM SIGKDD International Conference on Know-ledge Discovery and Data Mining.2017:255-264. [41]MONATH N,KOBREN A,KRISHNAMURTHY A,et al.Scalable Hierarchical Clustering with Tree Grafting[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2019:1438-1448. [42]SHANNON C E.A mathematical theory of communication[J].Bell System Technical Journal,1948,27(3):379-4232. [43]SHANNON C.The lattice theory of information[J].Transactions of the IRE professional Group on Information Theory,1953,1(1):105-107. [44]BROOKSJ F P.Three great challenges for half-century-old computer science[J].Journal of the ACM(JACM),2003,50(1):25-26. [45]LI A,PAN Y.Structural information and dynamical complexity of networks[J].IEEE Transactions on Information Theory,2016,62(6):3290-3339. [46]HUFFMAN D A.A method for the construction of minimum-redundancy codes[J].Proceedings of the IRE,1952,40(9):1098-1101. [47]LI A,YIN X,PAN Y.Three-dimensional gene map of cancer cell types:Structural entropy minimisation principle for defining tumour subtypes[J].Scientific Reports,2016,6(1):1-26. [48]GOODFELLOW I,BENGIO Y,COURVILLE A.Deep Lear-ning:Adaptive Computation and Machine Learning series[M].Massachusetts:The MIT Press,2016. [49]PARIS C,NEPAL S,JIN D,et al.A Comprehensive Survey on Community Detection with Deep Learning[J].arXiv:2105.12584,2021. [50]FICHTENBERGER H,GILLÉ M,SCHMIDT M,et al.BICO:BIRCH meets coresets for k-means clustering[C]//European Symposium on Algorithms.Berlin:Springer,2013:481-492. |
[1] | 王明, 武文芳, 王大玲, 冯时, 张一飞. 生成链接树:一种高数据真实性的反事实解释生成方法 Generative Link Tree:A Counterfactual Explanation Generation Approach with High Data Fidelity 计算机科学, 2022, 49(9): 33-40. https://doi.org/10.11896/jsjkx.220300158 |
[2] | 赵璐, 袁立明, 郝琨. 多示例学习算法综述 Review of Multi-instance Learning Algorithms 计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047 |
[3] | 成科扬, 王宁, 崔宏纲, 詹永照. 基于局部注意力图互迁移的可解释性优化方法 Interpretability Optimization Method Based on Mutual Transfer of Local Attention Map 计算机科学, 2022, 49(5): 64-70. https://doi.org/10.11896/jsjkx.210400176 |
[4] | 陈之彧, 单志龙. 知识追踪研究进展 Research Advances in Knowledge Tracing 计算机科学, 2022, 49(10): 83-95. https://doi.org/10.11896/jsjkx.211000119 |
[5] | 朝乐门, 王锐. 数据科学平台:特征、技术及趋势 Data Science Platform:Features,Technologies and Trends 计算机科学, 2021, 48(8): 1-12. https://doi.org/10.11896/jsjkx.210600033 |
[6] | 张佳嘉, 张小洪. 多分支卷积神经网络肺结节分类方法及其可解释性 Multi-branch Convolutional Neural Network for Lung Nodule Classification and Its Interpretability 计算机科学, 2020, 47(9): 129-134. https://doi.org/10.11896/jsjkx.190700203 |
[7] | 向伟, 王新维. 基于多类邻域三支决策模型的不平衡数据分类 Imbalance Data Classification Based on Model of Multi-class Neighbourhood Three-way Decision 计算机科学, 2020, 47(5): 103-109. https://doi.org/10.11896/jsjkx.180601099 |
[8] | 张雷, 胡博文, 张宁, 王茂森. 图像超分辨率全局残差递归网络 Global Residual Recursive Network for Image Super-resolution 计算机科学, 2019, 46(6A): 230-233. |
[9] | 冯艳红,于红,孙庚,孙娟娟. 基于BLSTM的命名实体识别方法 Named Entity Recognition Method Based on BLSTM 计算机科学, 2018, 45(2): 261-268. https://doi.org/10.11896/j.issn.1002-137X.2018.02.045 |
[10] | 陈玉金,李续武. 直觉模糊数决策粗糙集 Intuitionistic Fuzzy Numbers Decision-theoretic Rough Sets 计算机科学, 2018, 45(2): 254-260. https://doi.org/10.11896/j.issn.1002-137X.2018.02.044 |
[11] | 魏霖静,练智超,王联国,侯振兴. 基于词条与语意差异度量的文档聚类算法 Term and Semantic Difference Metric Based Document Clustering Algorithm 计算机科学, 2016, 43(12): 229-233. https://doi.org/10.11896/j.issn.1002-137X.2016.12.042 |
[12] | . 新型前馈网络学习算法在语音识别中的应用 计算机科学, 2008, 35(8): 122-124. |
[13] | 杨娟 白云 邱玉辉. 一种异构实时系统中任务流均衡负载的算术模型 计算机科学, 2005, 32(7): 222-223. |
|