计算机科学 ›› 2023, Vol. 50 ›› Issue (1): 9-17.doi: 10.11896/jsjkx.211000185

• 数据库&大数据&数据科学 • 上一篇    下一篇

机器学习层谱聚类综述

王少将1, 刘佳2, 郑锋3, 潘祎诚3   

  1. 1 华北计算技术研究所 北京 100846
    2 空军工程大学基础部 西安 710051
    3 北京航空航天大学计算机学院 北京 100191
  • 收稿日期:2021-10-25 修回日期:2022-03-14 出版日期:2023-01-15 发布日期:2023-01-09
  • 通讯作者: 潘祎诚(yichengp@buaa.edu.cn)
  • 作者简介:wangshaojiang@iscas.ac.cn
  • 基金资助:
    国家自然科学基金青年科学基金(61807034)

Survey on Hierarchical Clustering for Machine Learning

WANG Shaojiang1, LIU Jia2, ZHENG Feng3, PAN Yicheng3   

  1. 1 North China Institute of Computing Technology,Beijing 100846,China
    2 Foundation Department of Air Force Engineering University,Xi'an 710051,China
    3 School of Computer Science and Engineering,Beihang University,Beijing 100191,China
  • Received:2021-10-25 Revised:2022-03-14 Online:2023-01-15 Published:2023-01-09
  • About author:WANG Shaojiang,born in 1979,Ph.D.His main research interests include graph machine learning and data mi-ning.
    PAN Yicheng,born in 1983,Ph.D.His main research interests include algorithms,network modeling,big data and information theory.
  • Supported by:
    Young Scientists Fund of the National Natural Science Foundation of China(61807034).

摘要: 聚类分析在机器学习、数据挖掘、生物DNA信息等方面都起着极为关键的作用。聚类算法从方法学上可分为扁平聚类和层谱聚类。扁平聚类通常将数据集分为K个并行社区,社区之间没有交集,但现实世界的社区之间多具有不同层次之间的包含关系,因而层谱聚类算法能对数据进行更精细的分析,提供更好的可解释性。而相比扁平聚类,层谱聚类研究进展缓慢。针对层谱聚类面临的问题,从对代价函数的选择、聚类结果衡量指标、聚类算法性能等方面入手,调研了大量的相关文献。其中聚类结果衡量指标主要有模块度、Jaccard 指数、标准化互信息、树状图纯度等。扁平聚类算法中比较经典的算法有K-means算法、标签传播算法、DBSCAN 算法、谱聚类算法等。层谱聚类算法可以进一步划分为分裂聚类算法和凝聚聚类算法,分裂层谱聚类算法有二分K-means算法和递归稀疏割算法,凝聚层谱聚类算法有经典的Louvain算法、BIRCH 算法和近年来提出的HLP 算法、PERCH算法及GRINCH算法。最后,进一步分析了这些算法的优缺点,并总结全文。

关键词: 层谱聚类, 代价函数, 可解释性

Abstract: Clustering analysis plays a key role in machine learning,data mining and biological DNA information.Clustering algorithms can be categorized into flat clustering and hierarchical clustering.Flat clustering mostly divides the data set into K parallel communities without intersections,but the real communities have multi-level inclusion relations,so the hierarchical clustering algorithms can provide more elaborate analysis and better interpretability.Compared with flat clustering,the research progress of hierarchical clustering is slow.Aiming at the problem of hierarchical clustering,this paper surveys a large number of related papers in the aspects of the selection of cost functions,the evaluations of clustering results and the performance of clustering algorithms.The evaluation indices of clustering results mainly include modularity,Jaccard index,normalized mutual information,dendrogram purity,etc.Among the flat clustering algorithms,the classical algorithms include K-means algorithm,label propagation algorithm,DBSCAN algorithm,spectral algorithm and so on.The hierarchical clustering algorithms can be further classified into agglomerative clustering algorithms and split clustering algorithm.The split clustering algorithms involves dichotomy K-means algorithm,recursive sparsest cut algorithm,etc.Agglomerative clustering algorithms involves classical Louvain algorithm,BIRCH algorithm and more recent HLP algorithm,PERCH algorithm and GRINCH algorithm.This paper also further analyzes the advantages and disadvantages of these algorithms,and finally,the whole paper is summarized.

Key words: Hierarchical clustering, Cost function, Interpretability

中图分类号: 

  • TP181
[1]CHARIKAR M,CHATZIAFRATIS V.Approximate hierarchical clustering via sparsest cut and spreading metrics[C]//Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms.Society for Industrial and Applied Mathematics,2017:841-854.
[2]ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]//KDD' 1996.1996:226-231.
[3]NEWMAN M E J,GIRVAN M.Finding and evaluating community structure in networks[J].Physical Review E,2004,69(2):026113.
[4]CILIBRASI R,VITÁNYI P M B.Clustering by compression[J].IEEE Transactions on Information Theory,2005,51(4):1523-1545.
[5]JACCARD P.The distribution of the flora in the alpine zone.[J].New Phytologist,1912,11(2):37-50.
[6]HELLER K A,GHAHRAMANI Z.Bayesian hierarchical clustering[C]//Proceedings of the 22nd International Conference on Machine learning.2005:297-304.
[7]MANNING C D,SCHÜTZE H,RAGHAVAN P.Introductionto information retrieval[M].Cambridge:Cambridge University Press,2008.
[8]DASGUPTA S.A cost function for similarity-based hierarchical clustering[C]//the 48th Annual ACM SIGACT Symposium.ACM,2016:118-127.
[9]MACQUEEN J.Some methods for classification and analysis of multivariate observations[C]//Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability.1967:281-297.
[10]HARTIGAN J A,WONG M A.Algorithm AS 136:A k-means clustering algorithm[J].Journal of the Royal Statistical Society,1979,28(1):100-108.
[11]ACKERMANN M R,MÄRTENS M,RAUPACHC,et al.StreamKM++ A clustering algorithm for data streams[J].Journal of Experimental Algorithmics(JEA),2012,17:2.1-2.30.
[12]SCULLEY D.Web-scale k-means clustering[C]//Proceedings of the 19th International Conference on World Wide Web.2010:1177-1178.
[13]ZHU X,GHAHRAMANI Z.Learning from labels and unlabeled data with label propagation[J].Tech Report,2002,3175(2004):237-244.
[14]REN Z W,SUN Q S,WEI D.Consensus Affinity Graph Lear-ning for Multiple Kernel Clustering[J].IEEE Transactions on Cybernetics,2021,51(6):3273-3284.
[15]REN Z W,SUN Q S,WEI D.Multiple Kernel Clustering with Kernel k-Means Coupled Graph TensorLearning[C]//The Thirty-Fifth AAAI Conference on Artificial Intelligence.2021:9411-9418.
[16]REN Z W,SUN Q S.Simultaneous Global and Local GraphStructure Preserving for Multiple Kernel Clustering[J].IEEE Transactions on Neural Networks and Learning Systems,2021,32(5):1839-1851.
[17]TANG J,SUN J M,WANG C,et al.Social influence analysis in large-scale networks[C]//The 15th International Conference onKnowledge Discovery and Data Mining.2009:807-816.
[18]TANG J,ZHANG J,YAO L M,et al.ArnetMiner:extraction and mining of academic social networks[C]//The 14th International Conference on Knowledge Discovery and Data Mining.e2008:990-998.
[19]CHIANG W L,LIU X Q,SI S,et al.Cluster-gcn:An efficient algorithm for training deep and large graph convolutional networks[C]//Proceedings of the 25th ACM SIGKDD Inter-national Conference on Knowledge Discovery & Data Mining.2019:257-266.
[20]WHITE S,SMYTH P.A spectral clustering approach to finding communities in graphs[C]//Proceedings of the 2005 SIAM International Conference on Data Mining.2005:274-285.
[21]ZHANG L,ZHU Z G.Unsupervised feature learning for point cloud understanding by contrasting and clustering using graph convolutional neural networks[C]//2019 International Confe-rence on 3D Vision(3DV).2019:395-404.
[22]LECUNY.Modeles connexionistes de l'apprentissage[D].Pa-ris:Universite de Paris,1987.
[23]BOURLARD H,KAMP Y.Auto-association by multilayer perceptrons and singular value decomposition [J].Biological Cybernetics,1998(59):291-294.
[24]HINTON G E,ZEMEL R S.Autoencoders,Minimum Description Length and Helmholtz Free Energy[C]//Morgan Kaufmann.1993:3-10.
[25]HINTON G E,JAMES L.McClelland Learning Representations by Recirculation[C]//NIPS.1987:358-366.
[26]WANG X,JI H Y,SHI C.Heterogeneous graph attention network[C]//The World Wide Web Conference.2019:2022-2032.
[27]CORBETTAM,SHULMAN G L.Spatial neglect and attention networks [J].Annual Review of Neuroscience,2011(34):569-599.
[28]ZHANG K H,LI T P,SHEN S W.Adaptive graph convolutionalnetwork with attention graph clustering for co-saliency detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:9050-9059.
[29]MUKHERJEE S,ASNANI H,LIN E,et al.Clustergan:Latent space clustering in generative adversarial networks[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:4610-4617.
[30]CHEN X,DUAN Y,REIN H,et al.Infogan:Interpretable representation learning by information maximizing generative adversarial nets[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:2180-2188.
[31]LIU Z Y,WANG J H,LIANG Z W.Catgan:Category-awaregenerative adversarial networks with hierarchical evolutionary learning for category text generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2014:8425-8432.
[32]STEINBACH M,KARYPIS G,KUMAR V.A comparison of document clustering techniques[C]//Text Mining Workshop at KDD 2000(May 2000).2000.
[33]COHEN-ADDAD V,KANADE V,MALLMANN-TRENN F,et al.Hierarchical clustering:Objective functions and algorithms[C]//Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms.Society for Industrial and Applied Mathematics,2018:378-397.
[34]ROY A,POKUTTA S.Hierarchical clustering via spreadingmetrics[J].The Journal of Machine Learning Research,2017,18(1):3077-3111.
[35]ARORA S,RAO S,VAZIRANI U.Expander Flows,Geometric Embeddings and Graph Partitioning[C]//Proceedings of the 36th Annual ACM(Association for Computing Machinery) Symposium on Theory of Computing.Computer Science Department,Princeton University,2004:222-231.
[36]ROSSI R A,AHMED N K,KOH E,et al.Fast Hierarchical Graph Clustering in LinearTime[C]//Companion Proceedings of the Web Conference.2020:10-12.
[37]BLONDEL V D,GUILLAUME J L,LAMBIOTTE R,et al.Fast unfolding of communities in large networks[J/OL].Journal of Statistical Mechanics Theory & Experiment,2008.https://iopscience.iop.org/article/10.1088/1742-5468/2008/10/P10008.
[38]NEWMAN M E J.Modularity and community structure in networks[C]//Proceedings of the National Academy of Sciences.2006:8577-8582.
[39]ZHANG T,RAMAKRISHNAN R,LIVNY M.BIRCH:an efficient data clustering method for very large databases[J].ACM Sigmod Record,1996,25(2):103-114.
[40]KOBREN A,MONATH N,KRISHNAMURTHY A,et al.A hierarchical algorithm for extreme clustering[C]//Proceedings of the 23rd ACM SIGKDD International Conference on Know-ledge Discovery and Data Mining.2017:255-264.
[41]MONATH N,KOBREN A,KRISHNAMURTHY A,et al.Scalable Hierarchical Clustering with Tree Grafting[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2019:1438-1448.
[42]SHANNON C E.A mathematical theory of communication[J].Bell System Technical Journal,1948,27(3):379-4232.
[43]SHANNON C.The lattice theory of information[J].Transactions of the IRE professional Group on Information Theory,1953,1(1):105-107.
[44]BROOKSJ F P.Three great challenges for half-century-old computer science[J].Journal of the ACM(JACM),2003,50(1):25-26.
[45]LI A,PAN Y.Structural information and dynamical complexity of networks[J].IEEE Transactions on Information Theory,2016,62(6):3290-3339.
[46]HUFFMAN D A.A method for the construction of minimum-redundancy codes[J].Proceedings of the IRE,1952,40(9):1098-1101.
[47]LI A,YIN X,PAN Y.Three-dimensional gene map of cancer cell types:Structural entropy minimisation principle for defining tumour subtypes[J].Scientific Reports,2016,6(1):1-26.
[48]GOODFELLOW I,BENGIO Y,COURVILLE A.Deep Lear-ning:Adaptive Computation and Machine Learning series[M].Massachusetts:The MIT Press,2016.
[49]PARIS C,NEPAL S,JIN D,et al.A Comprehensive Survey on Community Detection with Deep Learning[J].arXiv:2105.12584,2021.
[50]FICHTENBERGER H,GILLÉ M,SCHMIDT M,et al.BICO:BIRCH meets coresets for k-means clustering[C]//European Symposium on Algorithms.Berlin:Springer,2013:481-492.
[1] 王明, 武文芳, 王大玲, 冯时, 张一飞.
生成链接树:一种高数据真实性的反事实解释生成方法
Generative Link Tree:A Counterfactual Explanation Generation Approach with High Data Fidelity
计算机科学, 2022, 49(9): 33-40. https://doi.org/10.11896/jsjkx.220300158
[2] 赵璐, 袁立明, 郝琨.
多示例学习算法综述
Review of Multi-instance Learning Algorithms
计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047
[3] 成科扬, 王宁, 崔宏纲, 詹永照.
基于局部注意力图互迁移的可解释性优化方法
Interpretability Optimization Method Based on Mutual Transfer of Local Attention Map
计算机科学, 2022, 49(5): 64-70. https://doi.org/10.11896/jsjkx.210400176
[4] 陈之彧, 单志龙.
知识追踪研究进展
Research Advances in Knowledge Tracing
计算机科学, 2022, 49(10): 83-95. https://doi.org/10.11896/jsjkx.211000119
[5] 朝乐门, 王锐.
数据科学平台:特征、技术及趋势
Data Science Platform:Features,Technologies and Trends
计算机科学, 2021, 48(8): 1-12. https://doi.org/10.11896/jsjkx.210600033
[6] 张佳嘉, 张小洪.
多分支卷积神经网络肺结节分类方法及其可解释性
Multi-branch Convolutional Neural Network for Lung Nodule Classification and Its Interpretability
计算机科学, 2020, 47(9): 129-134. https://doi.org/10.11896/jsjkx.190700203
[7] 向伟, 王新维.
基于多类邻域三支决策模型的不平衡数据分类
Imbalance Data Classification Based on Model of Multi-class Neighbourhood Three-way Decision
计算机科学, 2020, 47(5): 103-109. https://doi.org/10.11896/jsjkx.180601099
[8] 张雷, 胡博文, 张宁, 王茂森.
图像超分辨率全局残差递归网络
Global Residual Recursive Network for Image Super-resolution
计算机科学, 2019, 46(6A): 230-233.
[9] 冯艳红,于红,孙庚,孙娟娟.
基于BLSTM的命名实体识别方法
Named Entity Recognition Method Based on BLSTM
计算机科学, 2018, 45(2): 261-268. https://doi.org/10.11896/j.issn.1002-137X.2018.02.045
[10] 陈玉金,李续武.
直觉模糊数决策粗糙集
Intuitionistic Fuzzy Numbers Decision-theoretic Rough Sets
计算机科学, 2018, 45(2): 254-260. https://doi.org/10.11896/j.issn.1002-137X.2018.02.044
[11] 魏霖静,练智超,王联国,侯振兴.
基于词条与语意差异度量的文档聚类算法
Term and Semantic Difference Metric Based Document Clustering Algorithm
计算机科学, 2016, 43(12): 229-233. https://doi.org/10.11896/j.issn.1002-137X.2016.12.042
[12] .
新型前馈网络学习算法在语音识别中的应用

计算机科学, 2008, 35(8): 122-124.
[13] 杨娟 白云 邱玉辉.
一种异构实时系统中任务流均衡负载的算术模型

计算机科学, 2005, 32(7): 222-223.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!