Computer Science ›› 2023, Vol. 50 ›› Issue (1): 9-17.doi: 10.11896/jsjkx.211000185

• Database & Big Data & Data Science • Previous Articles     Next Articles

Survey on Hierarchical Clustering for Machine Learning

WANG Shaojiang1, LIU Jia2, ZHENG Feng3, PAN Yicheng3   

  1. 1 North China Institute of Computing Technology,Beijing 100846,China
    2 Foundation Department of Air Force Engineering University,Xi'an 710051,China
    3 School of Computer Science and Engineering,Beihang University,Beijing 100191,China
  • Received:2021-10-25 Revised:2022-03-14 Online:2023-01-15 Published:2023-01-09
  • About author:WANG Shaojiang,born in 1979,Ph.D.His main research interests include graph machine learning and data mi-ning.
    PAN Yicheng,born in 1983,Ph.D.His main research interests include algorithms,network modeling,big data and information theory.
  • Supported by:
    Young Scientists Fund of the National Natural Science Foundation of China(61807034).

Abstract: Clustering analysis plays a key role in machine learning,data mining and biological DNA information.Clustering algorithms can be categorized into flat clustering and hierarchical clustering.Flat clustering mostly divides the data set into K parallel communities without intersections,but the real communities have multi-level inclusion relations,so the hierarchical clustering algorithms can provide more elaborate analysis and better interpretability.Compared with flat clustering,the research progress of hierarchical clustering is slow.Aiming at the problem of hierarchical clustering,this paper surveys a large number of related papers in the aspects of the selection of cost functions,the evaluations of clustering results and the performance of clustering algorithms.The evaluation indices of clustering results mainly include modularity,Jaccard index,normalized mutual information,dendrogram purity,etc.Among the flat clustering algorithms,the classical algorithms include K-means algorithm,label propagation algorithm,DBSCAN algorithm,spectral algorithm and so on.The hierarchical clustering algorithms can be further classified into agglomerative clustering algorithms and split clustering algorithm.The split clustering algorithms involves dichotomy K-means algorithm,recursive sparsest cut algorithm,etc.Agglomerative clustering algorithms involves classical Louvain algorithm,BIRCH algorithm and more recent HLP algorithm,PERCH algorithm and GRINCH algorithm.This paper also further analyzes the advantages and disadvantages of these algorithms,and finally,the whole paper is summarized.

Key words: Hierarchical clustering, Cost function, Interpretability

CLC Number: 

  • TP181
[1]CHARIKAR M,CHATZIAFRATIS V.Approximate hierarchical clustering via sparsest cut and spreading metrics[C]//Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms.Society for Industrial and Applied Mathematics,2017:841-854.
[2]ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]//KDD' 1996.1996:226-231.
[3]NEWMAN M E J,GIRVAN M.Finding and evaluating community structure in networks[J].Physical Review E,2004,69(2):026113.
[4]CILIBRASI R,VITÁNYI P M B.Clustering by compression[J].IEEE Transactions on Information Theory,2005,51(4):1523-1545.
[5]JACCARD P.The distribution of the flora in the alpine zone.[J].New Phytologist,1912,11(2):37-50.
[6]HELLER K A,GHAHRAMANI Z.Bayesian hierarchical clustering[C]//Proceedings of the 22nd International Conference on Machine learning.2005:297-304.
[7]MANNING C D,SCHÜTZE H,RAGHAVAN P.Introductionto information retrieval[M].Cambridge:Cambridge University Press,2008.
[8]DASGUPTA S.A cost function for similarity-based hierarchical clustering[C]//the 48th Annual ACM SIGACT Symposium.ACM,2016:118-127.
[9]MACQUEEN J.Some methods for classification and analysis of multivariate observations[C]//Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability.1967:281-297.
[10]HARTIGAN J A,WONG M A.Algorithm AS 136:A k-means clustering algorithm[J].Journal of the Royal Statistical Society,1979,28(1):100-108.
[11]ACKERMANN M R,MÄRTENS M,RAUPACHC,et al.StreamKM++ A clustering algorithm for data streams[J].Journal of Experimental Algorithmics(JEA),2012,17:2.1-2.30.
[12]SCULLEY D.Web-scale k-means clustering[C]//Proceedings of the 19th International Conference on World Wide Web.2010:1177-1178.
[13]ZHU X,GHAHRAMANI Z.Learning from labels and unlabeled data with label propagation[J].Tech Report,2002,3175(2004):237-244.
[14]REN Z W,SUN Q S,WEI D.Consensus Affinity Graph Lear-ning for Multiple Kernel Clustering[J].IEEE Transactions on Cybernetics,2021,51(6):3273-3284.
[15]REN Z W,SUN Q S,WEI D.Multiple Kernel Clustering with Kernel k-Means Coupled Graph TensorLearning[C]//The Thirty-Fifth AAAI Conference on Artificial Intelligence.2021:9411-9418.
[16]REN Z W,SUN Q S.Simultaneous Global and Local GraphStructure Preserving for Multiple Kernel Clustering[J].IEEE Transactions on Neural Networks and Learning Systems,2021,32(5):1839-1851.
[17]TANG J,SUN J M,WANG C,et al.Social influence analysis in large-scale networks[C]//The 15th International Conference onKnowledge Discovery and Data Mining.2009:807-816.
[18]TANG J,ZHANG J,YAO L M,et al.ArnetMiner:extraction and mining of academic social networks[C]//The 14th International Conference on Knowledge Discovery and Data Mining.e2008:990-998.
[19]CHIANG W L,LIU X Q,SI S,et al.Cluster-gcn:An efficient algorithm for training deep and large graph convolutional networks[C]//Proceedings of the 25th ACM SIGKDD Inter-national Conference on Knowledge Discovery & Data Mining.2019:257-266.
[20]WHITE S,SMYTH P.A spectral clustering approach to finding communities in graphs[C]//Proceedings of the 2005 SIAM International Conference on Data Mining.2005:274-285.
[21]ZHANG L,ZHU Z G.Unsupervised feature learning for point cloud understanding by contrasting and clustering using graph convolutional neural networks[C]//2019 International Confe-rence on 3D Vision(3DV).2019:395-404.
[22]LECUNY.Modeles connexionistes de l'apprentissage[D].Pa-ris:Universite de Paris,1987.
[23]BOURLARD H,KAMP Y.Auto-association by multilayer perceptrons and singular value decomposition [J].Biological Cybernetics,1998(59):291-294.
[24]HINTON G E,ZEMEL R S.Autoencoders,Minimum Description Length and Helmholtz Free Energy[C]//Morgan Kaufmann.1993:3-10.
[25]HINTON G E,JAMES L.McClelland Learning Representations by Recirculation[C]//NIPS.1987:358-366.
[26]WANG X,JI H Y,SHI C.Heterogeneous graph attention network[C]//The World Wide Web Conference.2019:2022-2032.
[27]CORBETTAM,SHULMAN G L.Spatial neglect and attention networks [J].Annual Review of Neuroscience,2011(34):569-599.
[28]ZHANG K H,LI T P,SHEN S W.Adaptive graph convolutionalnetwork with attention graph clustering for co-saliency detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:9050-9059.
[29]MUKHERJEE S,ASNANI H,LIN E,et al.Clustergan:Latent space clustering in generative adversarial networks[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:4610-4617.
[30]CHEN X,DUAN Y,REIN H,et al.Infogan:Interpretable representation learning by information maximizing generative adversarial nets[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:2180-2188.
[31]LIU Z Y,WANG J H,LIANG Z W.Catgan:Category-awaregenerative adversarial networks with hierarchical evolutionary learning for category text generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2014:8425-8432.
[32]STEINBACH M,KARYPIS G,KUMAR V.A comparison of document clustering techniques[C]//Text Mining Workshop at KDD 2000(May 2000).2000.
[33]COHEN-ADDAD V,KANADE V,MALLMANN-TRENN F,et al.Hierarchical clustering:Objective functions and algorithms[C]//Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms.Society for Industrial and Applied Mathematics,2018:378-397.
[34]ROY A,POKUTTA S.Hierarchical clustering via spreadingmetrics[J].The Journal of Machine Learning Research,2017,18(1):3077-3111.
[35]ARORA S,RAO S,VAZIRANI U.Expander Flows,Geometric Embeddings and Graph Partitioning[C]//Proceedings of the 36th Annual ACM(Association for Computing Machinery) Symposium on Theory of Computing.Computer Science Department,Princeton University,2004:222-231.
[36]ROSSI R A,AHMED N K,KOH E,et al.Fast Hierarchical Graph Clustering in LinearTime[C]//Companion Proceedings of the Web Conference.2020:10-12.
[37]BLONDEL V D,GUILLAUME J L,LAMBIOTTE R,et al.Fast unfolding of communities in large networks[J/OL].Journal of Statistical Mechanics Theory & Experiment,2008.https://iopscience.iop.org/article/10.1088/1742-5468/2008/10/P10008.
[38]NEWMAN M E J.Modularity and community structure in networks[C]//Proceedings of the National Academy of Sciences.2006:8577-8582.
[39]ZHANG T,RAMAKRISHNAN R,LIVNY M.BIRCH:an efficient data clustering method for very large databases[J].ACM Sigmod Record,1996,25(2):103-114.
[40]KOBREN A,MONATH N,KRISHNAMURTHY A,et al.A hierarchical algorithm for extreme clustering[C]//Proceedings of the 23rd ACM SIGKDD International Conference on Know-ledge Discovery and Data Mining.2017:255-264.
[41]MONATH N,KOBREN A,KRISHNAMURTHY A,et al.Scalable Hierarchical Clustering with Tree Grafting[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2019:1438-1448.
[42]SHANNON C E.A mathematical theory of communication[J].Bell System Technical Journal,1948,27(3):379-4232.
[43]SHANNON C.The lattice theory of information[J].Transactions of the IRE professional Group on Information Theory,1953,1(1):105-107.
[44]BROOKSJ F P.Three great challenges for half-century-old computer science[J].Journal of the ACM(JACM),2003,50(1):25-26.
[45]LI A,PAN Y.Structural information and dynamical complexity of networks[J].IEEE Transactions on Information Theory,2016,62(6):3290-3339.
[46]HUFFMAN D A.A method for the construction of minimum-redundancy codes[J].Proceedings of the IRE,1952,40(9):1098-1101.
[47]LI A,YIN X,PAN Y.Three-dimensional gene map of cancer cell types:Structural entropy minimisation principle for defining tumour subtypes[J].Scientific Reports,2016,6(1):1-26.
[48]GOODFELLOW I,BENGIO Y,COURVILLE A.Deep Lear-ning:Adaptive Computation and Machine Learning series[M].Massachusetts:The MIT Press,2016.
[49]PARIS C,NEPAL S,JIN D,et al.A Comprehensive Survey on Community Detection with Deep Learning[J].arXiv:2105.12584,2021.
[50]FICHTENBERGER H,GILLÉ M,SCHMIDT M,et al.BICO:BIRCH meets coresets for k-means clustering[C]//European Symposium on Algorithms.Berlin:Springer,2013:481-492.
[1] WANG Ming, WU Wen-fang, WANG Da-ling, FENG Shi, ZHANG Yi-fei. Generative Link Tree:A Counterfactual Explanation Generation Approach with High Data Fidelity [J]. Computer Science, 2022, 49(9): 33-40.
[2] ZHAO Lu, YUAN Li-ming, HAO Kun. Review of Multi-instance Learning Algorithms [J]. Computer Science, 2022, 49(6A): 93-99.
[3] CHENG Ke-yang, WANG Ning, CUI Hong-gang, ZHAN Yong-zhao. Interpretability Optimization Method Based on Mutual Transfer of Local Attention Map [J]. Computer Science, 2022, 49(5): 64-70.
[4] WANG Mao-guang, JI Hao-yue, WANG Tian-ming. Study on Risk Control Model of Selective Ensemble Algorithm Based on Hierarchical Clustering and Simulated Annealing [J]. Computer Science, 2022, 49(11A): 210800105-7.
[5] CHEN Zhi-yu, SHAN Zhi-long. Research Advances in Knowledge Tracing [J]. Computer Science, 2022, 49(10): 83-95.
[6] CHEN Qing-chao, WANG Tao, FENG Wen-bo, YIN Shi-zhuang, LIU Li-jun. Unknown Binary Protocol Format Inference Method Based on Longest Continuous Interval [J]. Computer Science, 2020, 47(8): 313-318.
[7] LIU Jing, FANG Xian-wen. Mining Method of Business Process Change Based on Cost Alignment [J]. Computer Science, 2020, 47(7): 78-83.
[8] XIANG Wei, WANG Xin-wei. Imbalance Data Classification Based on Model of Multi-class Neighbourhood Three-way Decision [J]. Computer Science, 2020, 47(5): 103-109.
[9] ZHANG Yun-fan,ZHOU Yu,HUANG Zhi-qiu. Semantic Similarity Based API Usage Pattern Recommendation [J]. Computer Science, 2020, 47(3): 34-40.
[10] ZHANG Lei, HU Bo-wen, ZHANG Ning, WANG Mao-sen. Global Residual Recursive Network for Image Super-resolution [J]. Computer Science, 2019, 46(6A): 230-233.
[11] XIA Ying, LI Liu-jie, ZHANG XU, BAE Hae-young. Weighted Oversampling Method Based on Hierarchical Clustering for Unbalanced Data [J]. Computer Science, 2019, 46(4): 22-27.
[12] WU Yi-fan, CUI Yan-peng, HU Jian-wei. Alert Processing Method Based on Hierarchical Clustering [J]. Computer Science, 2019, 46(4): 203-209.
[13] FENG Yan-hong, YU Hong, SUN Geng and SUN Juan-juan. Named Entity Recognition Method Based on BLSTM [J]. Computer Science, 2018, 45(2): 261-268.
[14] WANG Shu-yi and DONG Dong. Mining of API Usage Pattern Based on Clustering and Partial Order Sequences [J]. Computer Science, 2017, 44(Z6): 486-490.
[15] LI Feng and XIE Si-hong. Study on Abnormal Diagnosis of Moving ECG Signals Based on Unsupervised Learning [J]. Computer Science, 2017, 44(Z11): 68-71.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!