计算机科学 ›› 2025, Vol. 52 ›› Issue (2): 134-144.doi: 10.11896/jsjkx.240800040
张曼静1, 何玉林1,2, 李旭1, 黄哲学2
ZHANG Manjing1, HE Yulin1,2, LI Xu1, HUANG Zhexue2
摘要: 针对大数据聚类中存在的计算资源消耗大、聚类效率低的问题,提出了一种新的基于节点抽样的分布式二阶段聚类方法。该方法首先在各个本地节点对节点上的数据执行局部聚类操作,并基于局部聚类结果,从每个节点中抽取代表性的数据样本,然后将各节点选定的样本数据传输至中央节点。之后,在中央节点上,对合并的样本数据进行进一步的聚类分析,并将样本聚类的结果传回各个本地节点。最后,各本地节点结合自身的局部聚类结果和中央节点的样本聚类结果,完成最终的聚类标签统一。通过以上流程,所提方法实现了对集中式聚类算法的分布式改造,能够快速一致地完成对全局数据的聚类分析。理论分析和数值实验均表明,与传统的全量数据集中式聚类方法相比,二阶段聚类方法有效地结合了并行处理的高效性和集成分析的准确性,在保证聚类质量的前提下能够显著降低计算资源的消耗,是一种可行的大数据聚类分布式解决方案。
中图分类号:
[1]AHMED M,SERAJ R,ISLAM S M S.The k-means algorithm:A comprehensive survey and performance evaluation [J].Electronics,2020,9(8):1295. [2]IKOTUN A M,EZUGWU A E,ABUALIGAH L,et al.A comprehensive review of K-means clustering algorithms:Variants analysis and advances in big data era[J].Information Sciences,2023,622:178-210. [3]NIELSEN F,NIELSEN F.Hierarchical clustering [M]//Introduction to HPC with MPI for Data Science.Springer International Publishing,2016:195-211. [4]JACKSI K,IBRAHIM R K,ZEEBAREE S R M,et al.Clustering documents based on semantic similarity using HAC and K-means algorithms [C]//Proceedings of the 2020 International Conference on Advanced Science and Engineering(ICOASE).IEEE,2020:205-210. [5]RAMADHANI F,ZARLIS M,SUWILO S.Improvement of the BIRCH algorithm for big data clustering [C]//IOP Conference Series:Materials Science and Engineering.IOP Publishing,2020,725(1):012090. [6]ZHANG T,RAMAKRISHNAN R,LIVNY M.BIRCH:an efficient data clustering method for very large databases [J].ACM Sigmod Record,1996,25(2):103-114. [7]ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise [C]//Proceedings of the Second International Conference on Knowledge Discovery and Data Mining(KDD'96).AAAI Press,1996:226-231. [8]DENG D.DBSCAN clustering algorithm based on density [C]//Proceedings of the 2020 7th International Forum on Electrical Engineering and Automation(IFEEA).IEEE,2020:949-953. [9]ANKERST M,BREUNIG M M,KRIEGEL H P,et al.OP-TICS:Ordering points to identify the clustering structure [J].ACM Sigmod Record,1999,28(2):49-60. [10]AGRAWAL R,GEHRKE J,GUNOPULOS D,et al.Automatic subspace clustering of high dimensional data for data mining applications [C]//Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data.ACM,1998:94-105. [11]WANG W,YANG J,MUNTZ R.STING:A statistical information grid approach to spatial data mining [C]//Proceedings of the 23rd International Conference on Very Large Data Bases(VLDB'97).Morgan Kaufmann Publishers Inc.,1997:186-195. [12]MOON T K.The expectation-maximization algorithm [J].IEEE Signal Processing Magazine,1996,13(6):47-60. [13]DO C B,BATZOGLOU S.What is the expectation maximization algorithm? [J].Nature Biotechnology,2008,26(8):897-899. [14]KAUFFMANN J,ESDERS M,RUFF L,et al.From clustering to cluster explanations via neural networks [J].IEEE Transactions on Neural Networks and Learning Systems,2022,35(2):1926-1940. [15]REN Y,PU J,YANG Z,et al.Deep clustering:A comprehensive survey [J].arXiv:2210.04142,2022. [16]MAHDI M A,HOSNY K M,ELHENAWY I.Scalable clustering algorithms for big data:A review [J].IEEE Access,2021,9:80015-80027. [17]OTHMAN S M,BA-ALWI F M,ALSOHYBE N T,et al.Intrusion detection model using machine learning algorithm on Big Data environment [J].Journal of Big Data,2018,5(1):1-12. [18]LI J,IZAKIAN H,PEDRYCZ W,et al.Clustering-based anoma-ly detection in multivariate time series data [J].Applied Soft Computing,2021,100:106919. [19]LI G,LIU F,SHARMA A,et al.Research on the natural language recognition method based oncluster analysis using neural network [J].Mathematical Problems in Engineering,2021,2021(1):9982305. [20]CHOWDHURY G G.Natural language processing [J].Annual Review of Information Science and Technology,2003,37(1):51-89. [21]SILVERSTEIN C,MARAIS H,HENZINGER M,et al.Analysis of a very large web search engine query log [C]//Procee-dings of the ACM SIGIR Forum.ACM,1999:6-12. [22]ZHAO S,ZHU L,WANG X,et al.Centerclip:Token clustering for efficient text-video retrieval [C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.2022:970-981. [23]MENG Q,ZHAO S,HUANG Z,et al.Magface:A universalrepresentation for face recognition and quality assessment [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:14225-14234. [24]YANG L,CHEN D,ZHAN X,et al.Learning to cluster faces via confidence and connectivity estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:13369-13378. [25]SAMBASIVAM S,THEODOSOPOULOS N.Advanced dataclustering methods of mining Web documents [J].Issues in Informing Science & Information Technology,2006,3:563-579. [26]SUN J G,LIU J,ZHAO L Y.Research on Clustering Algorithms [J].Journal of Software,2008,19(1):48-61. [27]OYEWOLE G J,THOPIL G A.Data clustering:application and trends [J].Artificial Intelligence Review,2023,56(7):6439-6475. [28]HASHEMI S E,GHOLIAN-JOUYBARI F,HAJIAGHAEI-KESHTELI M.A fuzzy C-means algorithm for optimizing data clustering [J].Expert Systems with Applications,2023,227:120377. [29]HUANG Z.Extensions to the k-means algorithm for clustering large data sets with categorical values [J].Data Mining and Knowledge Discovery,1998,2(3):283-304. [30]BAHMANI B,MOSELEY B,VATTANI A,et al.Scalable K-Means+ [J].Proceedings of the VLDB Endowment,2012,5(7):622-633. [31]MAO Y M,GAN D J,MWAKAPESA D S,et al.A MapReduce-based K-means clustering algorithm [J].The Journal of Supercomputing,2022,78(4):5181-5202. [32]SARDAR T H,ANSARI Z.Distributed big data clustering using MapReduce-based fuzzy C-medoids [J].Journal of The Institution of Engineers(India):Series B,2022,103(1):73-82. [33]ZHANG T,RAMAKRISHNAN R,LIVNY M.BIRCH:an efficient data clustering method for very large databases [J].ACM Sigmod Record,1996,25(2):103-114. [34]RAMADHANI F,ZARLIS M,SUWILO S.Improvement of the BIRCH algorithm for big data clustering [C]//IOP Conference Series:Materials Science and Engineering.IOP Publishing,2020. [35]SCHUBERT E,SANDER J,ESTER M,et al.DBSCAN revisited,revisited:why and how you should(still) use DBSCAN [J].ACM Transactions on Database Systems(TODS),2017,42(3):1-21. [36]ANKERST M,BREUNIG M M,KRIEGEL H P,et al.OP-TICS:Ordering points to identify the clustering structure [J].ACM Sigmod Record,1999,28(2):49-60. [37]HINNEBURG A,KEIM D A.A general approach to clustering in large databases with noise [J].Knowledge and Information Systems,2003,5:387-415. [38]MU C,HOU Y,ZHAO J,et al.Stream-DBSCAN:A streaming distributed clustering model for water quality monitoring [J].Applied Sciences,2023,13(9):5408. [39]WU G,CAO L,TIAN H,et al.HY-DBSCAN:A hybrid parallel DBSCAN clustering algorithm scalable on distributed-memory computers [J].Journal of Parallel and Distributed Computing,2022,168:57-69. [40]XIE M,AREF W G.PD-DBSCAN:A density-based clusteringalgorithm for exploratory data analysis and data mining [J].Knowledge and Information Systems,2013,37(3):685-722. [41]KAUFFMANN J,ESDERS M,RUFF L,et al.From clustering to cluster explanations via neural networks [J].IEEE Transactions on Neural Networks and Learning Systems,2022,35(2):1926-1940. [42]LIU Y,TU W,ZHOU S,et al.Deep graph clustering via dual correlation reduction [C]//Proceedings of the AAAI Conference on Artificial Intelligence.2022:7603-7611. [43]COCHRAN W G.Sampling Techniques [M].John Wiley & Sons,1977. [44]LECUN Y,CORTES C,BURGES C.The MNIST database of handwritten digits [J].Neural Computation,1998,10(5):1191-1232. [45]HETTICH S,KEGELMEYER W P.The UCI KDD Archive.KDD Cup 1999 Data[EB/OL].https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data. |
|