Computer Science ›› 2025, Vol. 52 ›› Issue (2): 134-144.doi: 10.11896/jsjkx.240800040

• Database & Big Data & Data Science • Previous Articles     Next Articles

Distributed Two-stage Clustering Method Based on Node Sampling

ZHANG Manjing1, HE Yulin1,2, LI Xu1, HUANG Zhexue2   

  1. 1 Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ),Shenzhen,Guangdong 518107,China
    2 College of Computer Science & Software Engineering,Shenzhen University,Shenzhen,Guangdong 518060,China
  • Received:2024-08-06 Revised:2024-10-08 Online:2025-02-15 Published:2025-02-17
  • About author:ZHANG Manjing,born in 1992,Ph.D,research associate.Her main research interests include statistical analysis theo-ry and method for big data,design and application of big data machine learning algorithms.
    HE Yulin,born in 1982.Ph.D,research fellow,is a member of CCF(No.97303M).His main research interests include big data approximate computing technologies,multi-sample statistics theo-ries and methods,and data mining and machine algorithms and their applications.
  • Supported by:
    Science and Technology Major Project of Shenzhen(202302D074),Natural Science Foundation of Guangdong Province(2023A1515011667),Basic Research Foundation of Shenzhen(JCYJ20210324093609026) and Guangdong Basic and Applied Basic Research Foundation(2023B1515120020).

Abstract: To address the challenges of high computational resource consumption and low clustering efficiency in big data clustering scenarios,researchers have proposed an innovative distributed two-stage clustering method based on node sampling.This method first performs local clustering operations on the data at each local node,and then extracts representative data samples from each node based on the results of the local clustering.The selected sample data from each node is then transmitted to the central node.At the central node,further clustering analysis is conducted on the aggregated sample data,and the results of the sample clustering are transmitted back to the local nodes.Finally,each local node combines its own local clustering results with the sample clustering results to complete the final unification of clustering labels.Through this process,the two-stage clustering method has transformed the traditional centralized clustering algorithms into a more scalable,distributed model,and ensures the consistency of the clustering result to the global dataset.Theoretical analysis and experimental results both indicate that compared with the conventional full-data centralized clustering techniques,the two-stage clustering method offers a framework which effectively integrates the efficiency of parallel processing and the accuracy of integrated analysis.Without sacrificing the accuracy of clustering,it significantly improves clustering efficiency and reduces time costs,which provided a feasible and robust distributed solution tailored for the complexities inherent in big data clustering tasks.

Key words: Big data clustering, Distributed computing, Node sampling, Parallel computing, Two-stage clustering

CLC Number: 

  • TP391.4
[1]AHMED M,SERAJ R,ISLAM S M S.The k-means algorithm:A comprehensive survey and performance evaluation [J].Electronics,2020,9(8):1295.
[2]IKOTUN A M,EZUGWU A E,ABUALIGAH L,et al.A comprehensive review of K-means clustering algorithms:Variants analysis and advances in big data era[J].Information Sciences,2023,622:178-210.
[3]NIELSEN F,NIELSEN F.Hierarchical clustering [M]//Introduction to HPC with MPI for Data Science.Springer International Publishing,2016:195-211.
[4]JACKSI K,IBRAHIM R K,ZEEBAREE S R M,et al.Clustering documents based on semantic similarity using HAC and K-means algorithms [C]//Proceedings of the 2020 International Conference on Advanced Science and Engineering(ICOASE).IEEE,2020:205-210.
[5]RAMADHANI F,ZARLIS M,SUWILO S.Improvement of the BIRCH algorithm for big data clustering [C]//IOP Conference Series:Materials Science and Engineering.IOP Publishing,2020,725(1):012090.
[6]ZHANG T,RAMAKRISHNAN R,LIVNY M.BIRCH:an efficient data clustering method for very large databases [J].ACM Sigmod Record,1996,25(2):103-114.
[7]ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise [C]//Proceedings of the Second International Conference on Knowledge Discovery and Data Mining(KDD'96).AAAI Press,1996:226-231.
[8]DENG D.DBSCAN clustering algorithm based on density [C]//Proceedings of the 2020 7th International Forum on Electrical Engineering and Automation(IFEEA).IEEE,2020:949-953.
[9]ANKERST M,BREUNIG M M,KRIEGEL H P,et al.OP-TICS:Ordering points to identify the clustering structure [J].ACM Sigmod Record,1999,28(2):49-60.
[10]AGRAWAL R,GEHRKE J,GUNOPULOS D,et al.Automatic subspace clustering of high dimensional data for data mining applications [C]//Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data.ACM,1998:94-105.
[11]WANG W,YANG J,MUNTZ R.STING:A statistical information grid approach to spatial data mining [C]//Proceedings of the 23rd International Conference on Very Large Data Bases(VLDB'97).Morgan Kaufmann Publishers Inc.,1997:186-195.
[12]MOON T K.The expectation-maximization algorithm [J].IEEE Signal Processing Magazine,1996,13(6):47-60.
[13]DO C B,BATZOGLOU S.What is the expectation maximization algorithm? [J].Nature Biotechnology,2008,26(8):897-899.
[14]KAUFFMANN J,ESDERS M,RUFF L,et al.From clustering to cluster explanations via neural networks [J].IEEE Transactions on Neural Networks and Learning Systems,2022,35(2):1926-1940.
[15]REN Y,PU J,YANG Z,et al.Deep clustering:A comprehensive survey [J].arXiv:2210.04142,2022.
[16]MAHDI M A,HOSNY K M,ELHENAWY I.Scalable clustering algorithms for big data:A review [J].IEEE Access,2021,9:80015-80027.
[17]OTHMAN S M,BA-ALWI F M,ALSOHYBE N T,et al.Intrusion detection model using machine learning algorithm on Big Data environment [J].Journal of Big Data,2018,5(1):1-12.
[18]LI J,IZAKIAN H,PEDRYCZ W,et al.Clustering-based anoma-ly detection in multivariate time series data [J].Applied Soft Computing,2021,100:106919.
[19]LI G,LIU F,SHARMA A,et al.Research on the natural language recognition method based oncluster analysis using neural network [J].Mathematical Problems in Engineering,2021,2021(1):9982305.
[20]CHOWDHURY G G.Natural language processing [J].Annual Review of Information Science and Technology,2003,37(1):51-89.
[21]SILVERSTEIN C,MARAIS H,HENZINGER M,et al.Analysis of a very large web search engine query log [C]//Procee-dings of the ACM SIGIR Forum.ACM,1999:6-12.
[22]ZHAO S,ZHU L,WANG X,et al.Centerclip:Token clustering for efficient text-video retrieval [C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.2022:970-981.
[23]MENG Q,ZHAO S,HUANG Z,et al.Magface:A universalrepresentation for face recognition and quality assessment [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:14225-14234.
[24]YANG L,CHEN D,ZHAN X,et al.Learning to cluster faces via confidence and connectivity estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:13369-13378.
[25]SAMBASIVAM S,THEODOSOPOULOS N.Advanced dataclustering methods of mining Web documents [J].Issues in Informing Science & Information Technology,2006,3:563-579.
[26]SUN J G,LIU J,ZHAO L Y.Research on Clustering Algorithms [J].Journal of Software,2008,19(1):48-61.
[27]OYEWOLE G J,THOPIL G A.Data clustering:application and trends [J].Artificial Intelligence Review,2023,56(7):6439-6475.
[28]HASHEMI S E,GHOLIAN-JOUYBARI F,HAJIAGHAEI-KESHTELI M.A fuzzy C-means algorithm for optimizing data clustering [J].Expert Systems with Applications,2023,227:120377.
[29]HUANG Z.Extensions to the k-means algorithm for clustering large data sets with categorical values [J].Data Mining and Knowledge Discovery,1998,2(3):283-304.
[30]BAHMANI B,MOSELEY B,VATTANI A,et al.Scalable K-Means+ [J].Proceedings of the VLDB Endowment,2012,5(7):622-633.
[31]MAO Y M,GAN D J,MWAKAPESA D S,et al.A MapReduce-based K-means clustering algorithm [J].The Journal of Supercomputing,2022,78(4):5181-5202.
[32]SARDAR T H,ANSARI Z.Distributed big data clustering using MapReduce-based fuzzy C-medoids [J].Journal of The Institution of Engineers(India):Series B,2022,103(1):73-82.
[33]ZHANG T,RAMAKRISHNAN R,LIVNY M.BIRCH:an efficient data clustering method for very large databases [J].ACM Sigmod Record,1996,25(2):103-114.
[34]RAMADHANI F,ZARLIS M,SUWILO S.Improvement of the BIRCH algorithm for big data clustering [C]//IOP Conference Series:Materials Science and Engineering.IOP Publishing,2020.
[35]SCHUBERT E,SANDER J,ESTER M,et al.DBSCAN revisited,revisited:why and how you should(still) use DBSCAN [J].ACM Transactions on Database Systems(TODS),2017,42(3):1-21.
[36]ANKERST M,BREUNIG M M,KRIEGEL H P,et al.OP-TICS:Ordering points to identify the clustering structure [J].ACM Sigmod Record,1999,28(2):49-60.
[37]HINNEBURG A,KEIM D A.A general approach to clustering in large databases with noise [J].Knowledge and Information Systems,2003,5:387-415.
[38]MU C,HOU Y,ZHAO J,et al.Stream-DBSCAN:A streaming distributed clustering model for water quality monitoring [J].Applied Sciences,2023,13(9):5408.
[39]WU G,CAO L,TIAN H,et al.HY-DBSCAN:A hybrid parallel DBSCAN clustering algorithm scalable on distributed-memory computers [J].Journal of Parallel and Distributed Computing,2022,168:57-69.
[40]XIE M,AREF W G.PD-DBSCAN:A density-based clusteringalgorithm for exploratory data analysis and data mining [J].Knowledge and Information Systems,2013,37(3):685-722.
[41]KAUFFMANN J,ESDERS M,RUFF L,et al.From clustering to cluster explanations via neural networks [J].IEEE Transactions on Neural Networks and Learning Systems,2022,35(2):1926-1940.
[42]LIU Y,TU W,ZHOU S,et al.Deep graph clustering via dual correlation reduction [C]//Proceedings of the AAAI Conference on Artificial Intelligence.2022:7603-7611.
[43]COCHRAN W G.Sampling Techniques [M].John Wiley & Sons,1977.
[44]LECUN Y,CORTES C,BURGES C.The MNIST database of handwritten digits [J].Neural Computation,1998,10(5):1191-1232.
[45]HETTICH S,KEGELMEYER W P.The UCI KDD Archive.KDD Cup 1999 Data[EB/OL].https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data.
[1] LIAO Qiucheng, ZHOU Yang, LIN Xinhua. Metrics and Tools for Evaluating the Deviation in Parallel Timing [J]. Computer Science, 2025, 52(5): 41-49.
[2] HUANG Chenxi, LI Jiahui, YAN Hui, ZHONG Ying, LU Yutong. Investigation on Load Balancing Strategies for Lattice Boltzmann Method with Local Grid Refinement [J]. Computer Science, 2025, 52(5): 101-108.
[3] XU He, ZHOU Tao, LI Peng, QIN Fangfang, JI Yimu. LU Parallel Decomposition Optimization Algorithm Based on Kunpeng Processor [J]. Computer Science, 2024, 51(9): 51-58.
[4] WANG Hancheng, DAI Haipeng, CHEN Zhipeng, CHEN Shusen, CHEN Guihai. Large-scale Network Community Detection Algorithm Based on MapReduce [J]. Computer Science, 2024, 51(4): 11-18.
[5] ZHONG Zhenyu, LIN Yongliang, WANG Haotian, LI Dongwen, SUN Yufei, ZHANG Yuzhi. Automatic Pipeline Parallel Training Framework for General-purpose Computing Devices [J]. Computer Science, 2024, 51(12): 129-136.
[6] LI Siyao, LI Shanglin, LUO Jingzhi. Parallel Computing of Reentry Vehicle Trajectory by Multiple Shooting Method Based onOPENMP [J]. Computer Science, 2024, 51(11A): 231000019-6.
[7] PENG Weidong, GUO Wei, WEI Lin. Reconfigurable Computing System for Parallel Implementation of SVM Training Based on FPGA [J]. Computer Science, 2024, 51(11A): 231100120-7.
[8] WANG Xiaozhong, ZHANG Zuyu. Multi Level Parallel Computing for SW26010 Discontinuous Galerkin Finite Element Algorithm [J]. Computer Science, 2024, 51(11A): 240700055-5.
[9] HE Weilong, SU Lingli, GUO Bingxuan, LI Maosen, HAO Yan. Research and Implementation of Dynamic Scene 3D Perception Technology Based on BinocularEstimation [J]. Computer Science, 2024, 51(11A): 240300045-8.
[10] ZHAI Xulun, ZHANG Yongguang, JIN Anzhao, QIANG Wei, LI Mengbing. Parallel DVB-RCS2 Turbo Decoding on Multi-core CPU [J]. Computer Science, 2023, 50(6): 22-28.
[11] HAN Qiqi, LIU Xin. Application of Air-Sea Coupled Mode in High-speed Interconnection Environment [J]. Computer Science, 2023, 50(11A): 221000136-5.
[12] DING Yue, XU Chuanfu, QIU Haozhong, DAI Weixi, WANG Qingsong, LIN Yongzhen, WANG Zhenghua. Study on Cross-platform Heterogeneous Parallel Computing for Lattice Boltzmann Multi-phase Flow Simulations Based on SYCL [J]. Computer Science, 2023, 50(11): 32-40.
[13] FENG Chen, GU Jingjing. Efficient Distributed Training Framework for Federated Learning [J]. Computer Science, 2023, 50(11): 317-326.
[14] CHEN Xin, LI Fang, DING Hai-xin, SUN Wei-ze, LIU Xin, CHEN De-xun, YE Yue-jin, HE Xiang. Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture [J]. Computer Science, 2022, 49(6): 99-107.
[15] ZHU Ruo-chen, YANG Chang-chun, ZHANG Deng-hui. EGOS-DST:Efficient Schema-guided Approach to One-step Dialogue State Tracking for Diverse Expressions [J]. Computer Science, 2022, 49(11A): 210900246-7.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!