计算机科学 ›› 2026, Vol. 53 ›› Issue (4): 24-32.doi: 10.11896/jsjkx.251000037
高贵晨, 姜少峰
GAO Guichen, JIANG Shaofeng
摘要: 聚类是机器学习中的经典任务,旨在根据相似度度量将数据划分为若干簇。k-均值聚类作为最基本的聚类模型,自提出以来已被深入研究并在众多领域得到广泛应用。聚焦k-均值模型的求解问题,从理论计算机科学的视角出发,介绍k-均值的(接近)线性时间的快速近似算法的研究进展。此外,简要讨论其他相关大数据计算模型中的聚类算法的相关进展,包括动态、数据流与并行计算等计算模型。
中图分类号:
| [1]JÉGOU H,DOUZE M,SCHMID C.Product Quantization forNearest Neighbor Search[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2011,33(1):117-128. [2]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global Vectors for Word Representation[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing.ACL,2014:1532-1543. [3]BABENKO A,LEMPITSKY V S.Efficient Indexing of Billion-Scale Datasets of Deep Descriptors[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE Computer Society,2016:2055-2063. [4]MAHAJAN M,NIMBHORKAR P,VARADARAJAN K R.The Planar k-Means Problem is NP-hard[J].Theoretical Computer Science,2012,442:13-21. [5]INABA M,KATOH N,IMAI H.Applications of WeightedVoronoi Diagrams and Randomization to Variance-Based k-Clustering[C]//Proceedings of Annual Symposium on Computational Geometry.New York:ACM,1994:332-339. [6]MATOUŠEK J.On Approximate Geometric k-Clustering[J].Discrete & Computational Geometry,2000,24(1):61-84. [7]DE LA VEGA W F,KARPINSKI M,KENYON C,et al.Approximation Schemes for Clustering Problems[C]//Procee-dings of Annual ACM Symposium on Theory of Computing.New York:ACM,2003:50-58. [8]HAR-PELED S,MAZUMDAR S.On Coresets for k-Means andk-Median Clustering[C]//Proceedings of Annual ACM Symposium on Theory of Computing.New York:ACM,2004:291-300. [9]FELDMAN D,MONEMIZADEH M,SOHLER C.A PTAS fork-Means Clustering Based on Weak Coresets[C]//Proceedings of ACM Symposium on Computational Geometry.New York:ACM,2007:11-18. [10]CHEN K.On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications[J].SIAM Journal on Computing,2009,39(3):923-947. [11]KUMAR A,SABHARWAL Y,SEN S.Linear-Time Approxi-mation Schemes for Clustering Problems in Any Dimensions[J].Journal of the ACM,2010,57(2):5:1-5:32. [12]COHEN-ADDAD V.A Fast Approximation Scheme for Low-Dimensional k-Means[C]//Proceedings of Annual ACM-SIAM Symposium on Discrete Algorithms.New Orleans:SIAM,2018:430-440. [13]FRIGGSTAD Z,REZAPOUR M,SALAVATIPOUR M R.Local Search Yields a PTAS for k-Means in Doubling Metrics[J].SIAM Journal on Computing,2019,48(2):452-480. [14]COHEN-ADDAD V,KLEIN P N,MATHIEU C.Local Search Yields Approximation Schemes for k-Means and k-Median in Euclidean and Minor-Free Metrics[J].SIAM Journal on Computing,2019,48(2):644-667. [15]ABBASI F,BANERJEE S,BYRKA J,et al.ParameterizedApproximation Schemes for Clustering with General Norm Objectives[C]//Proceedings of IEEE Annual Symposium on Foundations of Computer Science.IEEE,2023:1377-1399. [16]COHEN-ADDAD V,FELDMANN A E,SAULPIC D.Near-Linear Time Approximation Schemes for Clustering in Doubling Metrics[J].Journal of the ACM,2021,68(6):44:1-44:34. [17]AWASTHI P,CHARIKAR M,KRISHNASWAMY R,et al.The Hardness of Approximation of Euclidean k-Means[C]//Proceedings of International Symposium on Computational Geometry.Eindhoven:Schloss Dagstuhl-Leibniz-Zentrum für Informatik,2015:754-767. [18]LLOYD S P.Least Squares Quantization in PCM[J].IEEETransactions on Information Theory,1982,28(2):129-136. [19]ARTHUR D,VASSILVITSKII S.k-Means++:the Advantages of Careful Seeding[C]//Proceedings of Annual ACM-SIAM Symposium on Discrete Algorithms.New Orleans:SIAM,2007:1027-1035. [20]METTU R R,PLAXTON C G.Optimal Time Bounds for Approximate Clustering[J].Machine Learning,2004,56(1/3):35-60. [21]BACHEM O,LUCIC M,HASSANI S H,et al.Fast and Pro-vably Good Seedings for k-Means[C]//Proceeding of Annual Conference on Neural Information Processing Systems.Barcelona,2016:55-63. [22]LATTANZI S,SOHLER C.A Better k-Means++ Algorithm via Local Search[C]//Proceedings of International Conference on Machine Learning.Long Beach:PMLR,2019:3662-3671. [23]CHOO D,GRUNAU C,PORTMANN J,et al.k-Means++:Few More Steps Yield Constant Approximation[C]//Procee-dings of International Conference on Machine Learning.Virtual Event:PMLR,2020:1909-1917. [24]HUANG J,FENG Q,HUANG Z,et al.FLS:A New LocalSearch Algorithm for k-Means with Smaller Search Space[C]//Proceedings of International Joint Conference on Artificial Intelligence.Vienna:ijcai.org,2022:3092-3098. [25]BERETTA L,COHEN-ADDAD V,LATTANZI S,et al.Multi-Swap k-Means++[C]//Proceeding of Annual Conference on Neural Information Processing Systems.2023. [26]HUANG J,FENG Q,HUANG Z,et al.Linear Time Algorithms for k-Means with Multi-Swap Local Search[C]//Proceeding of Annual Conference on Neural Information Processing Systems.2023. [27]FAN C,LI P,LI X.LSDS++:Dual Sampling for Accelerated k-Means++[C]//Proceedings of International Conference on Machine Learning.PMLR,2023:9640-9649. [28]HUANG J,FENG Q,ZHANG Z,et al.Fast Local Search Algorithms for Clustering with Adaptive Sampling and Bandit Strategies[C]//Proceeding of Annual Conference on Neural Information Processing Systems.2025. [29]BECCHETTI L,BURY M,COHEN-ADDAD V,et al.Oblivious Dimension Reduction for k-Means:Beyond Subspaces and the Johnson-Lindenstrauss Lemma[C]//Proceedings of Annual ACM SIGACT Symposium on Theory of Computing.New York:ACM,2019:1039-1050. [30]BOUTSIDIS C,ZOUZIAS A,DRINEAS P.Random Projections for k-Means Clustering[C]//Proceedings of Annual Conference on Neural Information Processing Systems.Curran Associates Inc.,2010:298-306. [31]COHEN M B,ELDER S,MUSCO C,et al.Dimensionality Reduction for k-Means Clustering and Low Rank Approximation[C]//Proceedings of Annual ACM on Symposium on Theory of Computing.New York:ACM,2015:163-172. [32]MAKARYCHEV K,MAKARYCHEV Y,RAZENSHTEYN I.Performance of Johnson-Lindenstrauss Transform for k-Means and k-Medians Clustering[C]//Proceedings of Annual ACM SIGACT Symposium on Theory of Computing.New York:ACM,2019:1027-1038. [33]JOHNSON W B,LINDENSTRAUSS J.Extensions of Lipschitz Mappings into Hilbert Space[J].Contemporary mathematics,1984,26(1):189-206. [34]LANGBERG M,SCHULMAN L J.Universal Epsilon-Approximators for Integrals[C]//Proceedings of Annual ACM-SIAM Symposium on Discrete Algorithms.SIAM,2010:598-607. [35]FELDMAN D,LANGBERG M.A Unified Framework for Approximating and Clustering Data[C]//Proceedings of ACM Symposium on Theory of Computing.New York:ACM,2011:569-578. [36]DRAGANOV A,SAULPIC D,SCHWIEGELSHOHN C.Set-tling Time vs.Accuracy Tradeoffs for Clustering Big Data[J].Proceedings of the ACM on Management of Data,2024,2(3):173. [37]HAR-PELED S,KUSHAL A.Smaller Coresets for k-Medianand k-Means Clustering[J].Discrete & Computational Geometry,2007,37(1):3-19. [38]BACHEM O,LUCIC M,KRAUSE A.Scalable k-Means Clustering via Lightweight Coresets[C]//Proceedings of ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.New York:ACM,2018:1119-1127. [39]HUANG L,JIANG S H,LI J,et al.Epsilon-Coresets for Cluste-ring(with Outliers) in Doubling Metrics[C]//Proceedings of IEEE Annual Symposium on Foundations of Computer Science.IEEE Computer Society,2018:814-825. [40]SOHLER C,WOODRUFF D P.Strong Coresets for k-Medianand Subspace Approximation:Goodbye Dimension[C]//Proceedings of IEEE Annual Symposium on Foundations of Computer Science.IEEE Computer Society,2018:802-813. [41]FELDMAN D,SCHMIDT M,SOHLER C.Turning Big Data into Tiny Data:Constant-Size Coresets for k-Means,PCA,and Projective Clustering[J].SIAM Journal on Computing,2020,49(3):601-657. [42]HUANG L,VISHNOI N K.Coresets for Clustering in Euclidean Spaces:Importance Sampling is Nearly Optimal[C]//Proceedings of Annual ACM SIGACT Symposium on Theory of Computing.New York:ACM,2020:1416-1429. [43]BRAVERMAN V,JIANG S H,KRAUTHGAMER R,et al.Coresets for Clustering in Excluded-Minor Graphs and Beyond[C]//Proceedings of ACM-SIAM Symposium on Discrete Algorithms.SIAM,2021:2679-2696. [44]COHEN-ADDAD V,SAULPIC D,SCHWIEGELSHOHN C.A New Coreset Framework for Clustering[C]//Proceedings of Annual ACM SIGACT Symposium on Theory of Computing.New York:ACM,2021:169-182. [45]COHEN-ADDAD V,LARSEN K G,SAULPIC D,et al.To-wards Optimal Lower Bounds for k-Median and k-Means Coresets[C]//Proceedings of Annual ACM SIGACT Symposium on Theory of Computing.New York:ACM,2022:1038-1051. [46]COHEN-ADDAD V,LARSEN K G,SAULPIC D,et al.Im-proved Coresets for Euclidean k-Means[C]//Proceedings of Annual Conference on Neural Information Processing Systems.2022. [47]HUANG L,LI J,WU X.On Optimal Coreset Construction forEuclidean (k,z)-Clustering[C]//Proceedings of Annual ACM Symposium on Theory of Computing.New York:ACM,2024:1594-1604. [48]BANSAL N,COHEN-ADDAD V,PRABHU M,et al.Sensitivity Sampling for k-Means:Worst Case and Stability Optimal Coreset Bounds[C]//Proceedings of IEEE Annual Symposium on Foundations of Computer Science.IEEE,2024:1707-1723. [49]COHEN-ADDAD V,DRAGANOV A,RUSSO M,et al.A Tight VC-Dimension Analysis of Clustering Coresets with Applications[C]//Proceedings of Annual ACM-SIAM Symposium on Discrete Algorithms.SIAM,2025:4783-4808. [50]JAIN K,VAZIRANI V V.Approximation Algorithms for Me-tric Facility Location and k-Median Problems Using the Primal-Dual Schema and Lagrangian Relaxation[J].Journal of the ACM,2001,48(2):274-296. [51]KANUNGO T,MOUNT D M,NETANYAHU N S,et al.ALocal Search Approximation Algorithm for k-Means Clustering[J].Computational Geometry,2004,28(2/3):89-112. [52]GUPTA A,TANGWONGSAN K.Simpler Analyses of LocalSearch Algorithms for Facility Location[J].arXiv:0809.2554,2008. [53]AHMADIAN S,NOROUZI-FARD A,SVENSSON O,et al.Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms[C]//Proceedings of IEEE Annual Symposium on Foundations of Computer Science.IEEE Computer Society,2017:61-72. [54]GRANDONI F,OSTROVSKY R,RABANI Y,et al.A Refined Approximation for Euclidean k-Means[J].Information Proces-sing Letters,2022,176:106251. [55]COHEN-ADDAD V,ESFANDIARI H,MIRROKNI V S,et al.Improved Approximations for Euclidean k-Means and k-Median,via Nested Quasi-Independent Sets[C]//Proceedings of Annual ACM SIGACT Symposium on Theory of Computing.New York:ACM,2022:1621-1628. [56]CHARIKAR M,COHEN-ADDAD V,GAO R,et al.An Im-proved Greedy Approximation for(Metric) k-Means[C]//Proceedings of IEEE Annual Symposium on Foundations of Computer Science.IEEE Computer Society,2025. [57]BADOIU M,CZUMAJ A,INDYK P,et al.Facility Location in Sublinear Time[C]//Proceedings of Automata,Languages and Programming,International Colloquium.Springer,2005:866-877. [58]INDYK P,MOTWANI R.Approximate Nearest Neighbors:Towards Removing the Curse of Dimensionality[C]//Proceedings of Annual ACM Symposium on the Theory of Computing.New York:ACM,1998:604-613. [59]HAR-PELED S,INDYK P,MOTWANI R.Approximate Nea-rest Neighbor:Towards Removing the Curse of Dimensionality[J].Theory of Computing,2012,8(1):321-350. [60]ANDONI A,INDYK P.Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions[C]//Proceedings of Annual IEEE Symposium on Foundations of Computer Science.IEEE Computer Society,2006:459-468. [61]ANDONI A,INDYK P.Near-Optimal Hashing Algorithms forApproximate Nearest Neighbor in High Dimensions[J].Communications of the ACM,2008,51(1):117-122. [62]ANDONI A,RAZENSHTEYN I P.Optimal Data-DependentHashing for Approximate Near Neighbors[C]//Proceedings of Annual ACM on Symposium on Theory of Computing.New York:ACM,2015:793-801. [63]GOEL A,INDYK P,VARADARAJAN K R.Reductions among High Dimensional Proximity Problems[C]//Proceedings of Annual Symposium on Discrete Algorithms.Washington:ACM/SIAM,2001:769-778. [64]JIANG S H,JIN Y,LOU J,et al.Local Search for Clustering in Almost-Linear Time[C]//Proceedings of Annual ACM-SIAM Symposium on Discrete Algorithms.SIAM,2026. [65]HAR-PELED S,INDYK P,SIDIROPOULOS A.EuclideanSpanners in High Dimensions[C]//Proceedings of Annual ACM-SIAM Symposium on Discrete Algorithms.SIAM,2013:804-809. [66]THORUP M.Quick k-Median,k-Center,and Facility Locationfor Sparse Graphs[J].SIAM Journal on Computing,2004,34(2):405-432. [67]ARYA V,GARG N,KHANDEKAR R,et al.Local SearchHeuristics for k-Median and Facility Location Problems[J].SIAM Journal on Computing,2004,33(3):544-562. [68]LA TOUR M D,SAULPIC D.Almost-Linear Time Approximation Algorithm to Euclidean k-Median and k-Means[J].arXiv:2407.11217,2024. [69]METTU R R,PLAXTON C G.The Online Median Problem[J].SIAM Journal on Computing,2003,32(3):816-832. [70]COHEN-ADDAD V,LATTANZI S,NOROUZI-FARD A,et al.Fast and Accurate k-Means++ via Rejection Sampling[C]//Proceedings of Annual Conference on Neural Information Processing Systems.2020. [71]CHARIKAR M,HENZINGER M,HU L,et al.Simple,Scalable and Effective Clustering via One-Dimensional Projections[C]//Proceedings of Annual Conference on Neural Information Processing Systems.2023. [72]BHATTACHARYA S,COSTA M,FAROKHNEJAD E,et al.Fully Dynamic Euclidean k-Means[J].arXiv:2507.11256,2025. [73]LATTANZI S,VASSILVITSKII S.Consistent k-Clustering[C]//Proceedings of International Conference on Machine Learning.PMLR,2017:1975-1984. [74]COHEN-ADDAD V,HJULER N,PAROTSIDIS N,et al.Fully Dynamic Consistent Facility Location[C]//Proceedings of Annual Conference on Neural Information Processing Systems.2019:3250-3260. [75]BHATTACHARYA S,COSTA M,LATTANZI S,et al.Fully Dynamic k-Clustering in O~(k) Update Time[C]//Proceedings of Annual Conference on Neural Information Processing Systems.2023. [76]LA TOUR M D,HENZINGER M,SAULPIC D.Fully Dynamic k-Means Coreset in Near-Optimal Update Time[C]//Procee-dings of Annual European Symposium on Algorithms.London:Schloss Dagstuhl-Leibniz-Zentrum für Informatik,2024:100:1-100:16. [77]BHATTACHARYA S,COSTA M,GARG N,et al.Fully Dynamic k-Clustering with Fast Update Time and Small Recourse[C]//Proceedings of IEEE Annual Symposium on Foundations of Computer Science.IEEE,2024:216-227. [78]BHATTACHARYA S,COSTA M,FAROKHNEJAD E.Fully Dynamic k-Median with Near-Optimal Update Time and Recourse[C]//Proceedings of Annual ACM Symposium on Theory of Computing.New York:ACM,2025:1166-1177. [79]BRAVERMAN V,FRAHLING G,LANG H,et al.Clustering High Dimensional Dynamic Data Streams[C]//Proceedings of International Conference on Machine Learning.PMLR,2017:576-585. [80]SONG Z,YANG L F,ZHONG P.Sensitivity Sampling over Dynamic Geometric Data Streams with Applications to k-Clustering[J].arXiv:1802.00459,2018. [81]ACKERMANN M R,LAMMERSEN C,MÄRTENS M,et al.Streamkm++:A Clustering Algorithms for Data Streams[C]//Proceedings of the Twelfth Workshop on Algorithm Engineering and Experiments.SIAM,2010:173-187. [82]FICHTENBERGER H,GILLÉ M,SCHMIDT M,et al.BICO:BIRCH Meets Coresets for k-Means Clustering[C]//Procee-dings of Annual European Symposium on Algorithms.Springer,2013:481-492. [83]AILON N,JAISWAL R,MONTELEONI C.Streaming k-Means Approximation[C]//Proceedings of Annual Conference on Neural Information Processing Systems.Curran Associates Inc.,2009:10-18. [84]ZHANG Y,TANGWONGSAN K,TIRTHAPURA S.Strea-ming k-Means Clustering with Fast Queries[C]//Proceedings of IEEE International Conference on Data Engineering.IEEE Computer Society,2017:449-460. [85]BRAVERMAN V,FELDMAN D,LANG H,et al.StreamingCoreset Constructions for M-Estimators[C]//Approximation,Randomization,and Combinatorial Optimization.Cambridge:Schloss Dagstuhl-Leibniz-Zentrum für Informatik,2019:62:1-62:15. [86]COHEN-ADDAD V,WOODRUFF D P,ZHOU S.StreamingEuclidean k-Median and k-Means with o(logn) Space[C]//Proceedings of IEEE Annual Symposium on Foundations of Computer Science.IEEE,2023:883-908. [87]BRAVERMAN V,LANG H,LEVIN K D,et al.ClusteringProblems on Sliding Windows[C]//Proceedings of Annual ACM-SIAM Symposium on Discrete Algorithms.SIAM,2016:1374-1390. [88]EPASTO A,LATTANZI S,VASSILVITSKII S,et al.Submo-dular Optimization over Sliding Windows[C]//Proceedings of International Conference on World Wide Web.ACM,2017:421-430. [89]EPASTO A,MAHDIAN M,MIRROKNI V S,et al.Improved Sliding Window Algorithms for Clustering and Coverage via Bucketing-Based Sketches[C]//Proceedings of ACM-SIAM Symposium on Discrete Algorithms.SIAM,2022:3005-3042. [90]WOODRUFF D P,ZHONG P,ZHOU S.Near-Optimal k-Clustering in the Sliding Window Model[C]//Proceedings of Annual Conference on Neural Information Processing Systems.2023. [91]KARLOFF H J,SURI S,VASSILVITSKII S.A Model of Computation for Mapreduce[C]//Proceedings of Annual ACM-SIAM Symposium on Discrete Algorithms.SIAM,2010:938-948. [92]DEAN J,GHEMAWAT S.Mapreduce:Simplified Data Proces-sing on Large Clusters[J].Communications of the ACM,2008,51(1):107-113. [93]WHITE T.Hadoop-The Definitive Guide:Storage and Analysis at Internet Scale(4th ed)[M].O’Reilly,2015. [94]ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al.Spark:Cluster Computing with Working Sets[C]//2nd USENIX Workshop on Hot Topics in Cloud Computing.Boston:USENIX Association,2010. [95]COHEN-ADDAD V,KUHN F,PARSAEIAN Z.An Efficient Massively Parallel Constant-Factor Approximation Algorithm for the k-Means Problem[C]//Proceedings of Annual ACM-SIAM Symposium on Discrete Algorithms.SIAM,2026. [96]BHASKARA A,WIJEWARDENA M.Distributed Clusteringvia LSH Based Data Partitioning[C]//Proceedings of International Conference on Machine Learning.PMLR,2018:569-578. [97]CZUMAJ A,GAO G,JIANG S H,et al.Fully-Scalable MPC Algorithms for Clustering in High Dimension[C]//International Colloquium on Automata,Languages,and Programming.Tallinn:Schloss Dagstuhl-Leibniz-Zentrum für Informatik,2024:50:1-50:20. |
|
||