计算机科学 ›› 2019, Vol. 46 ›› Issue (9): 15-21.doi: 10.11896/j.issn.1002-137X.2019.09.002
秦悦1, 丁世飞1,2
QIN Yue1, DING Shi-fei1,2
摘要: 半监督聚类是结合半监督学习与聚类分析而提出的新的学习方法,其在机器学习中得到了广泛的重视和应用。传统无监督聚类算法在划分数据时并不需要任何数据属性,但在实际应用中,存在少量带有独立类标签或成对约束的监督信息的数据样本,学者们致力于将这些为数不多的监督信息运用于聚类,以得到更优的聚类结果,从而提出了半监督聚类。文中主要介绍了半监督聚类的理论基础和算法思想,并对半监督聚类的最新研究进展进行了综述。首先,对半监督学习的研究现状和分类进行了概述,并将生成式半监督学习、半监督SVM、基于图的半监督学习和协同训练这4种分类方法进行了对比;其次,针对半监督学习的聚类进行了详细的描述,并对4种典型半监督聚类算法(Cop-Kmeans算法、LCop-Kmeans算法、Seeded-Kmeans算法和SC-Kmeans算法)的算法思想进行了分析和总结,同时对这4种算法的优缺点进行了评价;然后,按照基于约束的半监督聚类和基于距离的半监督聚类两种情况,分别对半监督聚类的研究现状进行了阐述;最后,探讨了半监督聚类在生物信息学、图像分割以及计算机其他领域内的应用以及未来的研究方向。文中旨在使初学者能够快速了解半监督聚类的进展,理解典型的算法思想,并在之后的实际应用中能起到一定的指导作用。
中图分类号:
[1]HARTIGAN J A,WONG M A.Algorithm AS 136:A k-means clustering algorithm[J].Applied Statistics,1979,28(1):100-108. [2]MADDAH M,CRIMSON W E L,WARFIELD S K.Statistical modeling and EM clustering of white matter fiber tracts[C]//3rd IEEE International Symposium on Biomedical Imaging:Nano to Macro.New York:IEEE Press,2006:53-56. [3]LI K L,CAO Z,CAO L P,et al.Some Developments on Semi-Supervised Clustering [J].Pattern Recognition and Artificial Intelligence,2009,22(5):735-742.(in Chinese)李昆仑,曹铮,曹丽苹,等.半监督聚类的若干新进展[J].模式识别与人工智能,2009,22(5):735-742. [4]XIONG J B,LI Z K,LIU Y J.Research on the Present Situation of Semi-Supervised Clustering Algorithm[J].Modern Compu-ter,2009(12):61-64,77.(in Chinese)熊建斌,李振坤,刘怡俊.半监督聚类算法研究现状[J].现代计算机(专业版),2009(12):61-64,77. [5]LIU J W,LIU Y,LUO X L.Semi-Supervised Learning Methods[J].Chinese Journal of Computers,2015,38(8):1592-1617.(in Chinese)刘建伟,刘媛,罗雄麟.半监督学习方法[J].计算机学报,2015,38(8):1592-1617. [6]SCUDDER H I.Probability of error of some adaptive pattern-recognition machines[J].Information Theory IEEE Transactions on,1965,11(3):363-371. [7]FRALICK S.Learning to recognize patterns without a teacher[J].IEEE Transactions on Information Theory,2003,13(1):57-64. [8]AGRAWALA A.Learning with a probabilistic teacher[J].IEEE Transactions on Information Theory,1970,16(4):373-379. [9]MERZ C J,CLAIR D C,BOND W E.Semi-supervised adaptive resonance theory (SMART2)[C]//International Joint Confe-rence on Neural Networks.Baltimore:IEEE Press,1992:851-856. [10]SHAHSHAHANI B M,LANDGREBE D.The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon[J].IEEE Transactions on Geoscience & Remote Sensing,1994,32(5):1087-1095. [11]KINGMA D P,REZENDE D J,MOHAME D S.Semi-Supervised Learning with Deep Generative Models[J].Advances in Neural Information Processing Systems,2014,4:3581-3589. [12]KLEIN D,KAMVAR S D,MANNING C D.From Instance-le-vel Constraints to Space-Level Constraints:Making the Most of Prior Knowledge in Data Clustering[C]//Proceedings of the Nineteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2002:307-314. [13]CHENG S,SHI Y,QIN Q.Particle swarm optimization based semi-supervised learning on Chinese text categorization//IEEE Congress on Evolutionary Computation.New York:IEEE Press,2012:1-8. [14]WANG J,KUMAR S,CHANG S F.Semi-supervised hashing for scalable image retrieval[C]//IEEE Conference on Computer Vision and Pattern Recognition.San Francisco:DBLP,2010:3424-3431. [15]CHEN S G,ZHANG D Q.Experimental Comparisons of Semi-Supervised Dimensional Reduction Methods [J].Journal of software,2011,22(1):28-43.(in Chinese)陈诗国,张道强.半监督降维方法的实验比较[J].软件学报,2011,22(1):28-43. [16]ZHOU Z H,LI M.Semi-supervised regression with co-training[C]//International Joint Conference on Artificial Intelligence.San Francisco:Morgan Kaufmann Publishers Inc,2005:908-913. [17]MEHRKANOON S,ALZATE C,MALL R,et al.MulticlassSemi-supervised Learning Based Upon Kernel Spectral Clustering[J].IEEE Transactions on Neural Networks & Learning Systems,2015,26(4):720. [18]CALLUT J,FRANCOISSE K,SAERENS M,et al.Semi-supervised Classification from Discriminative Random Walks//ECML PKDD 2008.Berlin:Springer,2008:162-177. [19]周志华.Machine learning.北京:清华大学出版社,2016. [20]COZMAN F G,COHEN I.Unlabeled Data Can Degrade Classification Performance of Generative Classifiers[C]//Fifteenth International Florida Artificial Intelligence Society Conference.California:AAAI Press,2009:327-331. [21]CASTELLI V,COVER T M.On the exponential value of la-beled samples.Elsevier Science Inc,1995. [22]VAPNIK V,STERIN A.On structural risk minimization or overall risk in a problem of pattern recognition[J].Automation &Remote Control,1977,10(10):1495-1503. [23]WANG X,YU H.How to Break MD5 and Other Hash Functions[M]//Advances in Cryptology-EUROCRYPT 2005.DBLP,2005:19-35. [24]BLUM A,CHAWLA S.Learning from Labeled and Unlabeled Data using Graph Min-cuts[C]//Eighteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2001:19-26. [25]BELKIN M,NIYOGE P,SINDHWANI V.Manifold Regularization:A Geometric Framework for Learning from Labeled and Unlabeled Examples[J].Journal of Machine Learning Research,2006,7(1):2399-2434. [26]BLUM A,MITCHELL T.Combining labeled and unlabeled data with co-training[C]//Proceedings of the 11th Annual Confe-rence on Computational Learning Theory (COLT98).Wisconsin,Ml,1998:92-100. [27]COLDMAN S,ZHOU Y.Enhancing supervised learning withunlabeled data[C]//Proceedings of the 17th International Conference on Machine Learning(ICML’00).San Francisco:CA,2000:327-334. [28]WAGSTAFF K,CARDIE C.Clustering with instance-level constraints[C]//Proceedings of 17th International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2000:1097-1103. [29]WAGSTAFF K,CARDIE C,ROGERS S,et al.Constrained K-means Clustering with Background Knowledge[C]//Eighteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2001:577-584. [30]YANG Y,TAN W,LI T.Consensus clustering based on con-strained self-organizing map and improved Cop-Kmeans ensemble in intelligent decision support systems[J].Knowledge Based Systems,2012,32:101-115. [31]BASU S,BANERJEE A,MOONEY R.Semi-Supervised Clustering by Seeding[C]//Proceedings of 19th InternationalConfe-rence on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2002:19-26. [32]CHEN Z Y,WANG H J,HU M,et al.An active semi-super-vised clustering algorithm based on seeds set and pairwise constraints[J].Journal of Jilin University(Science Edition),2017,55(3):664-672.(in Chinese)陈志雨,王慧君,胡明,等.一种基于Seeds集和成对约束的主动半监督聚类算法[J].吉林大学学报(理学版),2017,55(3):664-672. [33]ZHENG L,LI T.Semi-supervised Hierarchical Clustering[C]//International Conference on Data Mining.2011. [34]ZHU Y,QIAN J H,JI Z B.An improved COP-Kmeans algo-rithm based on BFS.Beijing:China science and technology paper online .http://www.paper.edu.cn/releasepaper/content/201507-93.(in Chinese)朱煜,钱景辉,季正波.改进的基于广度优先搜索的COP-Kmeans算法.北京:中国科技论文在线 .http://www.paper.edu.cn/releasepaper/content/201507-93. [35] HE P,XU X H,LU L,et al.Semi-Supervised Clustering viaTwo-Level Random Walk[J].Journal of Software,2014,25(5):997-1013.(in Chinese)何萍,徐晓华,陆林,等.双层随机游走半监督聚类[J].软件学报,2014,25(5):997-1013. [36]TANG Q,LIAO Z G.A Semi-Supervised Clustering MethodBased on Affinity Propagation Algorithm[J].Electronic Information Warfare Technology,2017,32(1):8-12.(in Chinese)汤琼,廖泽广.一种基于AP算法的半监督聚类方法[J].电子信息对抗技术,2017,32(1):8-12. [37]YANG Y,RUTAYISIRE T,LIN C,et al.An Improved Cop-Kmeans Clustering for Solving Constraint Violation Based on MapReduce Framework[J].Fundamental Information,2013,126(4):301-318. [38]LI C M,XU S B,HAO Z F.Cross-Entropy semi-supervisedclustering based on pairwise constraints[J].Pattern Recognition and Artificial Intelligence,2017,30(7):598-608.(in Chinese)李晁铭,徐圣兵,郝志峰.基于成对约束的交叉熵半监督聚类算法[J].模式识别与人工智能,2017,30(7):598-608. [39]CHAI B F,LV F,LI W B,et al.Semi-supervised Kmeans Clustering Algorithm based on Active Learning Priors.[2018-11-25].http://kns.cnki.net/kcms/detail/51.1307.TP.20180719.1030.038.html.(in Chinese)柴变芳,吕峰,李文斌,等.基于主动学习先验的半监督Kmeans聚类算法.[2018-11-25].http://kns.cnki.net/kcms/detail/51.1307.TP.20180719.1030.038.html. [40]BASU S,BILENKO M,MOONEY R J.A probabilistic framework for semi-supervised clustering[C]//Proceedings of the Tenth ACM 0SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-04).New York:MIT Press,2008:59-68. [41]DING S,JIA H,DU M,et al.A semi-supervised approximate spectral clustering algorithm based on HMRF model[J].Information Sciences,2018,429:215-228. [42]ALOK A K,SAHA S,EKBAL A.Feature Selection and Semi-supervised Clustering Using Multi-objective Optimization[J].Springer Plus,2014,3(1):465. [43]WEI S,LI Z,ZHANG C.Combined constraint-based with metric-based in semi-supervised clustering ensemble[J].International Journal of Machine Learning & Cybernetics,2018,9(7):1085-1100. [44]SAHA S,KAUSHIK K,ALOK A K,et al.Multi-objective semi-supervised clustering of tissue samples for cancer diagnosis[J].Soft Computing,2016,20(9):3381-3392. [45]CHEN H S.Semi-supervised clustering ensemble for bio-mole-cular pattern mining.Guangzhou:South China University of Technology,2016.(in Chinese)陈弘晟.半监督聚类集成在生物分子模式挖掘中的应用.广州:华南理工大学,2016. [46]OROZCO-DUQUE A,BUSTAMANTE J,CASTELLANOS-DOMINGUEZ G.Semi-supervised clustering of fractionated electrograms for electroanatomical atrial mapping[J].Biomedical Engineering Online,2016,15(1):44. [47]AN Q Q,ZHANG F,LI Z X,et al.Research on image segmentation based on machine learning[J].Automation & Instrumentation,2018(6):29-31.(in Chinese)安强强,张峰,李赵兴,等.基于机器学习的图像分割研究[J].自动化与仪器仪表,2018(6):29-31. [48]LI Q L.Semi-supervised clustering based on constraints for images segmentation.Xi’an:XiDian University,2014.(in Chinese)李巧兰.基于约束的半监督聚类的图像分割算法研究 .西安:西安电子科技大学,2014. [49]LI Y W.Research on robust segmentation algorithm based onsemi-supervised fuzzy clustering.Xi’an:Xi’an University of Posts & Telecommunications,2018.(in Chinese)李亚文.鲁棒半监督模糊聚类分割算法研究.西安:西安邮电大学,2018. [50]ALOK A K,SAHA S,EKBAL A.Multi-objective semi-super-vised clustering for automatic pixel classification from remote sensing imagery[J].Soft Computing,2016,20(12):4733-4751. [51]FIORE U,PALMIERI F,CASTIGLIONE A,et al.Network anomaly detection with the restricted Boltzmann machine[J].Neuro Computing,2013,122:13-23. [52]LIANG C,LI C H.Novel Intrusion Detection Method Based on Semi-supervised Clustering[J].Computer Science,2016,43(5):87-90.(in Chinese)梁辰,李成海.一种新的半监督入侵检测方法[J].计算机科学,2016,43(5):87-90. [53]PENG T L,ZHANG W J,LAN J L,et al.Micro video annotation method based on semi-supervised clustering[J].Application Research of Computers,2016,33(3):948-952.(in Chinese)彭太乐,张文俊,蓝建梁,等.基于半监督聚类的微视频标注方法[J].计算机应用研究,2016,33(3):948-952. [54]ZHONG S.Semi-supervised model-based document clustering:A comparative study[J].Machine Learning,2006,65(1):3-29. [55]CHENG X M,YANG Q H,ZHAI Y P,et al.Test Case Selection Technique Base on Semi-supervised Clustering Method[J].Computer Science,2018,45(1):249-254.(in Chinese)程雪梅,杨秋辉,翟宇鹏,等.基于半监督聚类方法的测试用例选择技术[J].计算机科学,2018,45(1):249-254. |
[1] | 柴慧敏, 张勇, 方敏. 基于特征相似度聚类的空中目标分群方法 Aerial Target Grouping Method Based on Feature Similarity Clustering 计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203 |
[2] | 郑文萍, 刘美麟, 杨贵. 一种基于节点稳定性和邻域相似性的社区发现算法 Community Detection Algorithm Based on Node Stability and Neighbor Similarity 计算机科学, 2022, 49(9): 83-91. https://doi.org/10.11896/jsjkx.220400146 |
[3] | 鲁晨阳, 邓苏, 马武彬, 吴亚辉, 周浩浩. 基于分层抽样优化的面向异构客户端的联邦学习 Federated Learning Based on Stratified Sampling Optimization for Heterogeneous Clients 计算机科学, 2022, 49(9): 183-193. https://doi.org/10.11896/jsjkx.220500263 |
[4] | 冷典典, 杜鹏, 陈建廷, 向阳. 面向自动化集装箱码头的AGV行驶时间估计 Automated Container Terminal Oriented Travel Time Estimation of AGV 计算机科学, 2022, 49(9): 208-214. https://doi.org/10.11896/jsjkx.210700028 |
[5] | 宁晗阳, 马苗, 杨波, 刘士昌. 密码学智能化研究进展与分析 Research Progress and Analysis on Intelligent Cryptology 计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053 |
[6] | 刘冬梅, 徐洋, 吴泽彬, 刘倩, 宋斌, 韦志辉. 基于边框距离度量的增量目标检测方法 Incremental Object Detection Method Based on Border Distance Measurement 计算机科学, 2022, 49(8): 136-142. https://doi.org/10.11896/jsjkx.220100132 |
[7] | 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇. 基于大数据的进化网络影响力分析研究综述 Survey of Influence Analysis of Evolutionary Network Based on Big Data 计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240 |
[8] | 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111 |
[9] | 李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩. 基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究 Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network 计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094 |
[10] | 张光华, 高天娇, 陈振国, 于乃文. 基于N-Gram静态分析技术的恶意软件分类研究 Study on Malware Classification Based on N-Gram Static Analysis Technology 计算机科学, 2022, 49(8): 336-343. https://doi.org/10.11896/jsjkx.210900203 |
[11] | 陈明鑫, 张钧波, 李天瑞. 联邦学习攻防研究综述 Survey on Attacks and Defenses in Federated Learning 计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079 |
[12] | 刘丽, 李仁发. 医疗CPS协作网络控制策略优化 Control Strategy Optimization of Medical CPS Cooperative Network 计算机科学, 2022, 49(6A): 39-43. https://doi.org/10.11896/jsjkx.210300230 |
[13] | 李亚茹, 张宇来, 王佳晨. 面向超参数估计的贝叶斯优化方法综述 Survey on Bayesian Optimization Methods for Hyper-parameter Tuning 计算机科学, 2022, 49(6A): 86-92. https://doi.org/10.11896/jsjkx.210300208 |
[14] | 赵璐, 袁立明, 郝琨. 多示例学习算法综述 Review of Multi-instance Learning Algorithms 计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047 |
[15] | 侯夏晔, 陈海燕, 张兵, 袁立罡, 贾亦真. 一种基于支持向量机的主动度量学习算法 Active Metric Learning Based on Support Vector Machines 计算机科学, 2022, 49(6A): 113-118. https://doi.org/10.11896/jsjkx.210500034 |
|