计算机科学 ›› 2025, Vol. 52 ›› Issue (2): 134-144.doi: 10.11896/jsjkx.240800040

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于节点抽样的分布式二阶段聚类方法

张曼静1, 何玉林1,2, 李旭1, 黄哲学2   

  1. 1 人工智能与数字经济广东省实验室(深圳) 广东 深圳 518107
    2 深圳大学计算机与软件学院 广东 深圳 518060
  • 收稿日期:2024-08-06 修回日期:2024-10-08 出版日期:2025-02-15 发布日期:2025-02-17
  • 通讯作者: 何玉林(heyulin@gml.ac.cn)
  • 作者简介:(zhangmanjing@gml.ac.cn)
  • 基金资助:
    深圳市科技重大专项项目(202302D074);广东省自然科学基金面上项目(2023A1515011667);深圳市基础研究面上项目(JCYJ20210324093609026);广东省基础与应用基础研究基金粤深联合基金重点项目(2023B1515120020)

Distributed Two-stage Clustering Method Based on Node Sampling

ZHANG Manjing1, HE Yulin1,2, LI Xu1, HUANG Zhexue2   

  1. 1 Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ),Shenzhen,Guangdong 518107,China
    2 College of Computer Science & Software Engineering,Shenzhen University,Shenzhen,Guangdong 518060,China
  • Received:2024-08-06 Revised:2024-10-08 Online:2025-02-15 Published:2025-02-17
  • About author:ZHANG Manjing,born in 1992,Ph.D,research associate.Her main research interests include statistical analysis theo-ry and method for big data,design and application of big data machine learning algorithms.
    HE Yulin,born in 1982.Ph.D,research fellow,is a member of CCF(No.97303M).His main research interests include big data approximate computing technologies,multi-sample statistics theo-ries and methods,and data mining and machine algorithms and their applications.
  • Supported by:
    Science and Technology Major Project of Shenzhen(202302D074),Natural Science Foundation of Guangdong Province(2023A1515011667),Basic Research Foundation of Shenzhen(JCYJ20210324093609026) and Guangdong Basic and Applied Basic Research Foundation(2023B1515120020).

摘要: 针对大数据聚类中存在的计算资源消耗大、聚类效率低的问题,提出了一种新的基于节点抽样的分布式二阶段聚类方法。该方法首先在各个本地节点对节点上的数据执行局部聚类操作,并基于局部聚类结果,从每个节点中抽取代表性的数据样本,然后将各节点选定的样本数据传输至中央节点。之后,在中央节点上,对合并的样本数据进行进一步的聚类分析,并将样本聚类的结果传回各个本地节点。最后,各本地节点结合自身的局部聚类结果和中央节点的样本聚类结果,完成最终的聚类标签统一。通过以上流程,所提方法实现了对集中式聚类算法的分布式改造,能够快速一致地完成对全局数据的聚类分析。理论分析和数值实验均表明,与传统的全量数据集中式聚类方法相比,二阶段聚类方法有效地结合了并行处理的高效性和集成分析的准确性,在保证聚类质量的前提下能够显著降低计算资源的消耗,是一种可行的大数据聚类分布式解决方案。

关键词: 大数据聚类, 分布式计算, 节点抽样, 并行计算, 二阶段聚类

Abstract: To address the challenges of high computational resource consumption and low clustering efficiency in big data clustering scenarios,researchers have proposed an innovative distributed two-stage clustering method based on node sampling.This method first performs local clustering operations on the data at each local node,and then extracts representative data samples from each node based on the results of the local clustering.The selected sample data from each node is then transmitted to the central node.At the central node,further clustering analysis is conducted on the aggregated sample data,and the results of the sample clustering are transmitted back to the local nodes.Finally,each local node combines its own local clustering results with the sample clustering results to complete the final unification of clustering labels.Through this process,the two-stage clustering method has transformed the traditional centralized clustering algorithms into a more scalable,distributed model,and ensures the consistency of the clustering result to the global dataset.Theoretical analysis and experimental results both indicate that compared with the conventional full-data centralized clustering techniques,the two-stage clustering method offers a framework which effectively integrates the efficiency of parallel processing and the accuracy of integrated analysis.Without sacrificing the accuracy of clustering,it significantly improves clustering efficiency and reduces time costs,which provided a feasible and robust distributed solution tailored for the complexities inherent in big data clustering tasks.

Key words: Big data clustering, Distributed computing, Node sampling, Parallel computing, Two-stage clustering

中图分类号: 

  • TP391.4
[1]AHMED M,SERAJ R,ISLAM S M S.The k-means algorithm:A comprehensive survey and performance evaluation [J].Electronics,2020,9(8):1295.
[2]IKOTUN A M,EZUGWU A E,ABUALIGAH L,et al.A comprehensive review of K-means clustering algorithms:Variants analysis and advances in big data era[J].Information Sciences,2023,622:178-210.
[3]NIELSEN F,NIELSEN F.Hierarchical clustering [M]//Introduction to HPC with MPI for Data Science.Springer International Publishing,2016:195-211.
[4]JACKSI K,IBRAHIM R K,ZEEBAREE S R M,et al.Clustering documents based on semantic similarity using HAC and K-means algorithms [C]//Proceedings of the 2020 International Conference on Advanced Science and Engineering(ICOASE).IEEE,2020:205-210.
[5]RAMADHANI F,ZARLIS M,SUWILO S.Improvement of the BIRCH algorithm for big data clustering [C]//IOP Conference Series:Materials Science and Engineering.IOP Publishing,2020,725(1):012090.
[6]ZHANG T,RAMAKRISHNAN R,LIVNY M.BIRCH:an efficient data clustering method for very large databases [J].ACM Sigmod Record,1996,25(2):103-114.
[7]ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise [C]//Proceedings of the Second International Conference on Knowledge Discovery and Data Mining(KDD'96).AAAI Press,1996:226-231.
[8]DENG D.DBSCAN clustering algorithm based on density [C]//Proceedings of the 2020 7th International Forum on Electrical Engineering and Automation(IFEEA).IEEE,2020:949-953.
[9]ANKERST M,BREUNIG M M,KRIEGEL H P,et al.OP-TICS:Ordering points to identify the clustering structure [J].ACM Sigmod Record,1999,28(2):49-60.
[10]AGRAWAL R,GEHRKE J,GUNOPULOS D,et al.Automatic subspace clustering of high dimensional data for data mining applications [C]//Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data.ACM,1998:94-105.
[11]WANG W,YANG J,MUNTZ R.STING:A statistical information grid approach to spatial data mining [C]//Proceedings of the 23rd International Conference on Very Large Data Bases(VLDB'97).Morgan Kaufmann Publishers Inc.,1997:186-195.
[12]MOON T K.The expectation-maximization algorithm [J].IEEE Signal Processing Magazine,1996,13(6):47-60.
[13]DO C B,BATZOGLOU S.What is the expectation maximization algorithm? [J].Nature Biotechnology,2008,26(8):897-899.
[14]KAUFFMANN J,ESDERS M,RUFF L,et al.From clustering to cluster explanations via neural networks [J].IEEE Transactions on Neural Networks and Learning Systems,2022,35(2):1926-1940.
[15]REN Y,PU J,YANG Z,et al.Deep clustering:A comprehensive survey [J].arXiv:2210.04142,2022.
[16]MAHDI M A,HOSNY K M,ELHENAWY I.Scalable clustering algorithms for big data:A review [J].IEEE Access,2021,9:80015-80027.
[17]OTHMAN S M,BA-ALWI F M,ALSOHYBE N T,et al.Intrusion detection model using machine learning algorithm on Big Data environment [J].Journal of Big Data,2018,5(1):1-12.
[18]LI J,IZAKIAN H,PEDRYCZ W,et al.Clustering-based anoma-ly detection in multivariate time series data [J].Applied Soft Computing,2021,100:106919.
[19]LI G,LIU F,SHARMA A,et al.Research on the natural language recognition method based oncluster analysis using neural network [J].Mathematical Problems in Engineering,2021,2021(1):9982305.
[20]CHOWDHURY G G.Natural language processing [J].Annual Review of Information Science and Technology,2003,37(1):51-89.
[21]SILVERSTEIN C,MARAIS H,HENZINGER M,et al.Analysis of a very large web search engine query log [C]//Procee-dings of the ACM SIGIR Forum.ACM,1999:6-12.
[22]ZHAO S,ZHU L,WANG X,et al.Centerclip:Token clustering for efficient text-video retrieval [C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.2022:970-981.
[23]MENG Q,ZHAO S,HUANG Z,et al.Magface:A universalrepresentation for face recognition and quality assessment [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:14225-14234.
[24]YANG L,CHEN D,ZHAN X,et al.Learning to cluster faces via confidence and connectivity estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:13369-13378.
[25]SAMBASIVAM S,THEODOSOPOULOS N.Advanced dataclustering methods of mining Web documents [J].Issues in Informing Science & Information Technology,2006,3:563-579.
[26]SUN J G,LIU J,ZHAO L Y.Research on Clustering Algorithms [J].Journal of Software,2008,19(1):48-61.
[27]OYEWOLE G J,THOPIL G A.Data clustering:application and trends [J].Artificial Intelligence Review,2023,56(7):6439-6475.
[28]HASHEMI S E,GHOLIAN-JOUYBARI F,HAJIAGHAEI-KESHTELI M.A fuzzy C-means algorithm for optimizing data clustering [J].Expert Systems with Applications,2023,227:120377.
[29]HUANG Z.Extensions to the k-means algorithm for clustering large data sets with categorical values [J].Data Mining and Knowledge Discovery,1998,2(3):283-304.
[30]BAHMANI B,MOSELEY B,VATTANI A,et al.Scalable K-Means+ [J].Proceedings of the VLDB Endowment,2012,5(7):622-633.
[31]MAO Y M,GAN D J,MWAKAPESA D S,et al.A MapReduce-based K-means clustering algorithm [J].The Journal of Supercomputing,2022,78(4):5181-5202.
[32]SARDAR T H,ANSARI Z.Distributed big data clustering using MapReduce-based fuzzy C-medoids [J].Journal of The Institution of Engineers(India):Series B,2022,103(1):73-82.
[33]ZHANG T,RAMAKRISHNAN R,LIVNY M.BIRCH:an efficient data clustering method for very large databases [J].ACM Sigmod Record,1996,25(2):103-114.
[34]RAMADHANI F,ZARLIS M,SUWILO S.Improvement of the BIRCH algorithm for big data clustering [C]//IOP Conference Series:Materials Science and Engineering.IOP Publishing,2020.
[35]SCHUBERT E,SANDER J,ESTER M,et al.DBSCAN revisited,revisited:why and how you should(still) use DBSCAN [J].ACM Transactions on Database Systems(TODS),2017,42(3):1-21.
[36]ANKERST M,BREUNIG M M,KRIEGEL H P,et al.OP-TICS:Ordering points to identify the clustering structure [J].ACM Sigmod Record,1999,28(2):49-60.
[37]HINNEBURG A,KEIM D A.A general approach to clustering in large databases with noise [J].Knowledge and Information Systems,2003,5:387-415.
[38]MU C,HOU Y,ZHAO J,et al.Stream-DBSCAN:A streaming distributed clustering model for water quality monitoring [J].Applied Sciences,2023,13(9):5408.
[39]WU G,CAO L,TIAN H,et al.HY-DBSCAN:A hybrid parallel DBSCAN clustering algorithm scalable on distributed-memory computers [J].Journal of Parallel and Distributed Computing,2022,168:57-69.
[40]XIE M,AREF W G.PD-DBSCAN:A density-based clusteringalgorithm for exploratory data analysis and data mining [J].Knowledge and Information Systems,2013,37(3):685-722.
[41]KAUFFMANN J,ESDERS M,RUFF L,et al.From clustering to cluster explanations via neural networks [J].IEEE Transactions on Neural Networks and Learning Systems,2022,35(2):1926-1940.
[42]LIU Y,TU W,ZHOU S,et al.Deep graph clustering via dual correlation reduction [C]//Proceedings of the AAAI Conference on Artificial Intelligence.2022:7603-7611.
[43]COCHRAN W G.Sampling Techniques [M].John Wiley & Sons,1977.
[44]LECUN Y,CORTES C,BURGES C.The MNIST database of handwritten digits [J].Neural Computation,1998,10(5):1191-1232.
[45]HETTICH S,KEGELMEYER W P.The UCI KDD Archive.KDD Cup 1999 Data[EB/OL].https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!