计算机科学 ›› 2019, Vol. 46 ›› Issue (9): 15-21.doi: 10.11896/j.issn.1002-137X.2019.09.002

• 综述 • 上一篇    下一篇

半监督聚类综述

秦悦1, 丁世飞1,2   

  1. (中国矿业大学计算机科学与技术学院 江苏 徐州221116)1;
    (中国科学院计算技术研究所智能信息处理重点实验室 北京100190)2
  • 收稿日期:2018-09-13 出版日期:2019-09-15 发布日期:2019-09-02
  • 通讯作者: 丁世飞(1963-),男,博士,教授,CCF理事,主要研究方向为模式识别与人工智能、机器学习与数据挖掘、粗糙集与软计算、粒度计算、感知与认知计算,E-mail:dingsf@cumt.edu.cn
  • 作者简介:秦 悦(1994-),女,硕士生,主要研究方向为人工智能、机器学习、聚类,E-mail:qinyue23333@163.com;
  • 基金资助:
    国家自然科学基金(61672522,61379101),国家重点基础研究计划(2013CB329502)

Survey of Semi-supervised Clustering

QIN Yue1, DING Shi-fei1,2   

  1. (School of Computer Science and Technology,China University of Mining and Technology,Xuzhou,Jiangsu 221116,China)1;
    (Key Laboratory of Intelligent Information Processing,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China)2
  • Received:2018-09-13 Online:2019-09-15 Published:2019-09-02

摘要: 半监督聚类是结合半监督学习与聚类分析而提出的新的学习方法,其在机器学习中得到了广泛的重视和应用。传统无监督聚类算法在划分数据时并不需要任何数据属性,但在实际应用中,存在少量带有独立类标签或成对约束的监督信息的数据样本,学者们致力于将这些为数不多的监督信息运用于聚类,以得到更优的聚类结果,从而提出了半监督聚类。文中主要介绍了半监督聚类的理论基础和算法思想,并对半监督聚类的最新研究进展进行了综述。首先,对半监督学习的研究现状和分类进行了概述,并将生成式半监督学习、半监督SVM、基于图的半监督学习和协同训练这4种分类方法进行了对比;其次,针对半监督学习的聚类进行了详细的描述,并对4种典型半监督聚类算法(Cop-Kmeans算法、LCop-Kmeans算法、Seeded-Kmeans算法和SC-Kmeans算法)的算法思想进行了分析和总结,同时对这4种算法的优缺点进行了评价;然后,按照基于约束的半监督聚类和基于距离的半监督聚类两种情况,分别对半监督聚类的研究现状进行了阐述;最后,探讨了半监督聚类在生物信息学、图像分割以及计算机其他领域内的应用以及未来的研究方向。文中旨在使初学者能够快速了解半监督聚类的进展,理解典型的算法思想,并在之后的实际应用中能起到一定的指导作用。

关键词: 半监督聚类, 半监督学习, 标签, 成对约束, 机器学习, 聚类

Abstract: Semi-supervised clustering is a new learning method combining semi-supervised learning and clustering analysis,and it has been used widely in machine learning.The traditional unsupervised clustering algorithms do not need any data attributes when dividing data,but in practical applications,there are a small number of data samples for supervised information with independent class labels or paired constraints,so scholars are committed to applying these few supervised information into clustering to obtain better clustering results,thus proposing semi-supervised clustering.This paper mainly introduced the theoretical basis and algorithm ideas of semi-supervised clustering,and summarized the latest progress of semi-supervised clustering.Firstly,the current situation and classification of semi-supervised learning were reviewed,and the generative semi-supervised learning,semi-supervised SVM,semi-supervised learning based on graph and collaborative training were compared.Secondly,the clustering of semi-supervised learning was described in detail,four typical semi-supervised clustering algorithms (Cop-Kemans algorithm,LCop-Kmeans algorithm,Seeded-Kmeans algorithm and SC-Kmeans algorithm) were analyzed and summarized,and their advantages and disadvantages were eva-luated.Then,according to the two situations of semi-supervised clustering based on constraints and the semi-supervised clustering based on distance,the research status of semi-supervised clustering was expounded respectively.Finally,the applications of semi-supervised clustering in bioinformatics,image segmentation and other fields of computer and the future research directions were discussed.This paper aims to enable beginners to quickly know about the progress of semi-supervised clustering and understand the typical algorithm ideas,and it can play a guiding role in actual applications afterwards.

Key words: Clustering, Label, Machinelearning, Pairwise constraints, Semi-supervised clustering, Semi-supervised learning

中图分类号: 

  • TP181
[1]HARTIGAN J A,WONG M A.Algorithm AS 136:A k-means clustering algorithm[J].Applied Statistics,1979,28(1):100-108.
[2]MADDAH M,CRIMSON W E L,WARFIELD S K.Statistical modeling and EM clustering of white matter fiber tracts[C]//3rd IEEE International Symposium on Biomedical Imaging:Nano to Macro.New York:IEEE Press,2006:53-56.
[3]LI K L,CAO Z,CAO L P,et al.Some Developments on Semi-Supervised Clustering [J].Pattern Recognition and Artificial Intelligence,2009,22(5):735-742.(in Chinese)李昆仑,曹铮,曹丽苹,等.半监督聚类的若干新进展[J].模式识别与人工智能,2009,22(5):735-742.
[4]XIONG J B,LI Z K,LIU Y J.Research on the Present Situation of Semi-Supervised Clustering Algorithm[J].Modern Compu-ter,2009(12):61-64,77.(in Chinese)熊建斌,李振坤,刘怡俊.半监督聚类算法研究现状[J].现代计算机(专业版),2009(12):61-64,77.
[5]LIU J W,LIU Y,LUO X L.Semi-Supervised Learning Methods[J].Chinese Journal of Computers,2015,38(8):1592-1617.(in Chinese)刘建伟,刘媛,罗雄麟.半监督学习方法[J].计算机学报,2015,38(8):1592-1617.
[6]SCUDDER H I.Probability of error of some adaptive pattern-recognition machines[J].Information Theory IEEE Transactions on,1965,11(3):363-371.
[7]FRALICK S.Learning to recognize patterns without a teacher[J].IEEE Transactions on Information Theory,2003,13(1):57-64.
[8]AGRAWALA A.Learning with a probabilistic teacher[J].IEEE Transactions on Information Theory,1970,16(4):373-379.
[9]MERZ C J,CLAIR D C,BOND W E.Semi-supervised adaptive resonance theory (SMART2)[C]//International Joint Confe-rence on Neural Networks.Baltimore:IEEE Press,1992:851-856.
[10]SHAHSHAHANI B M,LANDGREBE D.The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon[J].IEEE Transactions on Geoscience & Remote Sensing,1994,32(5):1087-1095.
[11]KINGMA D P,REZENDE D J,MOHAME D S.Semi-Supervised Learning with Deep Generative Models[J].Advances in Neural Information Processing Systems,2014,4:3581-3589.
[12]KLEIN D,KAMVAR S D,MANNING C D.From Instance-le-vel Constraints to Space-Level Constraints:Making the Most of Prior Knowledge in Data Clustering[C]//Proceedings of the Nineteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2002:307-314.
[13]CHENG S,SHI Y,QIN Q.Particle swarm optimization based semi-supervised learning on Chinese text categorization//IEEE Congress on Evolutionary Computation.New York:IEEE Press,2012:1-8.
[14]WANG J,KUMAR S,CHANG S F.Semi-supervised hashing for scalable image retrieval[C]//IEEE Conference on Computer Vision and Pattern Recognition.San Francisco:DBLP,2010:3424-3431.
[15]CHEN S G,ZHANG D Q.Experimental Comparisons of Semi-Supervised Dimensional Reduction Methods [J].Journal of software,2011,22(1):28-43.(in Chinese)陈诗国,张道强.半监督降维方法的实验比较[J].软件学报,2011,22(1):28-43.
[16]ZHOU Z H,LI M.Semi-supervised regression with co-training[C]//International Joint Conference on Artificial Intelligence.San Francisco:Morgan Kaufmann Publishers Inc,2005:908-913.
[17]MEHRKANOON S,ALZATE C,MALL R,et al.MulticlassSemi-supervised Learning Based Upon Kernel Spectral Clustering[J].IEEE Transactions on Neural Networks & Learning Systems,2015,26(4):720.
[18]CALLUT J,FRANCOISSE K,SAERENS M,et al.Semi-supervised Classification from Discriminative Random Walks//ECML PKDD 2008.Berlin:Springer,2008:162-177.
[19]周志华.Machine learning.北京:清华大学出版社,2016.
[20]COZMAN F G,COHEN I.Unlabeled Data Can Degrade Classification Performance of Generative Classifiers[C]//Fifteenth International Florida Artificial Intelligence Society Conference.California:AAAI Press,2009:327-331.
[21]CASTELLI V,COVER T M.On the exponential value of la-beled samples.Elsevier Science Inc,1995.
[22]VAPNIK V,STERIN A.On structural risk minimization or overall risk in a problem of pattern recognition[J].Automation &Remote Control,1977,10(10):1495-1503.
[23]WANG X,YU H.How to Break MD5 and Other Hash Functions[M]//Advances in Cryptology-EUROCRYPT 2005.DBLP,2005:19-35.
[24]BLUM A,CHAWLA S.Learning from Labeled and Unlabeled Data using Graph Min-cuts[C]//Eighteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2001:19-26.
[25]BELKIN M,NIYOGE P,SINDHWANI V.Manifold Regularization:A Geometric Framework for Learning from Labeled and Unlabeled Examples[J].Journal of Machine Learning Research,2006,7(1):2399-2434.
[26]BLUM A,MITCHELL T.Combining labeled and unlabeled data with co-training[C]//Proceedings of the 11th Annual Confe-rence on Computational Learning Theory (COLT98).Wisconsin,Ml,1998:92-100.
[27]COLDMAN S,ZHOU Y.Enhancing supervised learning withunlabeled data[C]//Proceedings of the 17th International Conference on Machine Learning(ICML’00).San Francisco:CA,2000:327-334.
[28]WAGSTAFF K,CARDIE C.Clustering with instance-level constraints[C]//Proceedings of 17th International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2000:1097-1103.
[29]WAGSTAFF K,CARDIE C,ROGERS S,et al.Constrained K-means Clustering with Background Knowledge[C]//Eighteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2001:577-584.
[30]YANG Y,TAN W,LI T.Consensus clustering based on con-strained self-organizing map and improved Cop-Kmeans ensemble in intelligent decision support systems[J].Knowledge Based Systems,2012,32:101-115.
[31]BASU S,BANERJEE A,MOONEY R.Semi-Supervised Clustering by Seeding[C]//Proceedings of 19th InternationalConfe-rence on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2002:19-26.
[32]CHEN Z Y,WANG H J,HU M,et al.An active semi-super-vised clustering algorithm based on seeds set and pairwise constraints[J].Journal of Jilin University(Science Edition),2017,55(3):664-672.(in Chinese)陈志雨,王慧君,胡明,等.一种基于Seeds集和成对约束的主动半监督聚类算法[J].吉林大学学报(理学版),2017,55(3):664-672.
[33]ZHENG L,LI T.Semi-supervised Hierarchical Clustering[C]//International Conference on Data Mining.2011.
[34]ZHU Y,QIAN J H,JI Z B.An improved COP-Kmeans algo-rithm based on BFS.Beijing:China science and technology paper online .http://www.paper.edu.cn/releasepaper/content/201507-93.(in Chinese)朱煜,钱景辉,季正波.改进的基于广度优先搜索的COP-Kmeans算法.北京:中国科技论文在线 .http://www.paper.edu.cn/releasepaper/content/201507-93.
[35] HE P,XU X H,LU L,et al.Semi-Supervised Clustering viaTwo-Level Random Walk[J].Journal of Software,2014,25(5):997-1013.(in Chinese)何萍,徐晓华,陆林,等.双层随机游走半监督聚类[J].软件学报,2014,25(5):997-1013.
[36]TANG Q,LIAO Z G.A Semi-Supervised Clustering MethodBased on Affinity Propagation Algorithm[J].Electronic Information Warfare Technology,2017,32(1):8-12.(in Chinese)汤琼,廖泽广.一种基于AP算法的半监督聚类方法[J].电子信息对抗技术,2017,32(1):8-12.
[37]YANG Y,RUTAYISIRE T,LIN C,et al.An Improved Cop-Kmeans Clustering for Solving Constraint Violation Based on MapReduce Framework[J].Fundamental Information,2013,126(4):301-318.
[38]LI C M,XU S B,HAO Z F.Cross-Entropy semi-supervisedclustering based on pairwise constraints[J].Pattern Recognition and Artificial Intelligence,2017,30(7):598-608.(in Chinese)李晁铭,徐圣兵,郝志峰.基于成对约束的交叉熵半监督聚类算法[J].模式识别与人工智能,2017,30(7):598-608.
[39]CHAI B F,LV F,LI W B,et al.Semi-supervised Kmeans Clustering Algorithm based on Active Learning Priors.[2018-11-25].http://kns.cnki.net/kcms/detail/51.1307.TP.20180719.1030.038.html.(in Chinese)柴变芳,吕峰,李文斌,等.基于主动学习先验的半监督Kmeans聚类算法.[2018-11-25].http://kns.cnki.net/kcms/detail/51.1307.TP.20180719.1030.038.html.
[40]BASU S,BILENKO M,MOONEY R J.A probabilistic framework for semi-supervised clustering[C]//Proceedings of the Tenth ACM 0SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-04).New York:MIT Press,2008:59-68.
[41]DING S,JIA H,DU M,et al.A semi-supervised approximate spectral clustering algorithm based on HMRF model[J].Information Sciences,2018,429:215-228.
[42]ALOK A K,SAHA S,EKBAL A.Feature Selection and Semi-supervised Clustering Using Multi-objective Optimization[J].Springer Plus,2014,3(1):465.
[43]WEI S,LI Z,ZHANG C.Combined constraint-based with metric-based in semi-supervised clustering ensemble[J].International Journal of Machine Learning & Cybernetics,2018,9(7):1085-1100.
[44]SAHA S,KAUSHIK K,ALOK A K,et al.Multi-objective semi-supervised clustering of tissue samples for cancer diagnosis[J].Soft Computing,2016,20(9):3381-3392.
[45]CHEN H S.Semi-supervised clustering ensemble for bio-mole-cular pattern mining.Guangzhou:South China University of Technology,2016.(in Chinese)陈弘晟.半监督聚类集成在生物分子模式挖掘中的应用.广州:华南理工大学,2016.
[46]OROZCO-DUQUE A,BUSTAMANTE J,CASTELLANOS-DOMINGUEZ G.Semi-supervised clustering of fractionated electrograms for electroanatomical atrial mapping[J].Biomedical Engineering Online,2016,15(1):44.
[47]AN Q Q,ZHANG F,LI Z X,et al.Research on image segmentation based on machine learning[J].Automation & Instrumentation,2018(6):29-31.(in Chinese)安强强,张峰,李赵兴,等.基于机器学习的图像分割研究[J].自动化与仪器仪表,2018(6):29-31.
[48]LI Q L.Semi-supervised clustering based on constraints for images segmentation.Xi’an:XiDian University,2014.(in Chinese)李巧兰.基于约束的半监督聚类的图像分割算法研究 .西安:西安电子科技大学,2014.
[49]LI Y W.Research on robust segmentation algorithm based onsemi-supervised fuzzy clustering.Xi’an:Xi’an University of Posts & Telecommunications,2018.(in Chinese)李亚文.鲁棒半监督模糊聚类分割算法研究.西安:西安邮电大学,2018.
[50]ALOK A K,SAHA S,EKBAL A.Multi-objective semi-super-vised clustering for automatic pixel classification from remote sensing imagery[J].Soft Computing,2016,20(12):4733-4751.
[51]FIORE U,PALMIERI F,CASTIGLIONE A,et al.Network anomaly detection with the restricted Boltzmann machine[J].Neuro Computing,2013,122:13-23.
[52]LIANG C,LI C H.Novel Intrusion Detection Method Based on Semi-supervised Clustering[J].Computer Science,2016,43(5):87-90.(in Chinese)梁辰,李成海.一种新的半监督入侵检测方法[J].计算机科学,2016,43(5):87-90.
[53]PENG T L,ZHANG W J,LAN J L,et al.Micro video annotation method based on semi-supervised clustering[J].Application Research of Computers,2016,33(3):948-952.(in Chinese)彭太乐,张文俊,蓝建梁,等.基于半监督聚类的微视频标注方法[J].计算机应用研究,2016,33(3):948-952.
[54]ZHONG S.Semi-supervised model-based document clustering:A comparative study[J].Machine Learning,2006,65(1):3-29.
[55]CHENG X M,YANG Q H,ZHAI Y P,et al.Test Case Selection Technique Base on Semi-supervised Clustering Method[J].Computer Science,2018,45(1):249-254.(in Chinese)程雪梅,杨秋辉,翟宇鹏,等.基于半监督聚类方法的测试用例选择技术[J].计算机科学,2018,45(1):249-254.
[1] 柴慧敏, 张勇, 方敏.
基于特征相似度聚类的空中目标分群方法
Aerial Target Grouping Method Based on Feature Similarity Clustering
计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203
[2] 郑文萍, 刘美麟, 杨贵.
一种基于节点稳定性和邻域相似性的社区发现算法
Community Detection Algorithm Based on Node Stability and Neighbor Similarity
计算机科学, 2022, 49(9): 83-91. https://doi.org/10.11896/jsjkx.220400146
[3] 鲁晨阳, 邓苏, 马武彬, 吴亚辉, 周浩浩.
基于分层抽样优化的面向异构客户端的联邦学习
Federated Learning Based on Stratified Sampling Optimization for Heterogeneous Clients
计算机科学, 2022, 49(9): 183-193. https://doi.org/10.11896/jsjkx.220500263
[4] 冷典典, 杜鹏, 陈建廷, 向阳.
面向自动化集装箱码头的AGV行驶时间估计
Automated Container Terminal Oriented Travel Time Estimation of AGV
计算机科学, 2022, 49(9): 208-214. https://doi.org/10.11896/jsjkx.210700028
[5] 宁晗阳, 马苗, 杨波, 刘士昌.
密码学智能化研究进展与分析
Research Progress and Analysis on Intelligent Cryptology
计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[6] 刘冬梅, 徐洋, 吴泽彬, 刘倩, 宋斌, 韦志辉.
基于边框距离度量的增量目标检测方法
Incremental Object Detection Method Based on Border Distance Measurement
计算机科学, 2022, 49(8): 136-142. https://doi.org/10.11896/jsjkx.220100132
[7] 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇.
基于大数据的进化网络影响力分析研究综述
Survey of Influence Analysis of Evolutionary Network Based on Big Data
计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[8] 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航.
监督和半监督学习下的多标签分类综述
Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning
计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[9] 李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩.
基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究
Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network
计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094
[10] 张光华, 高天娇, 陈振国, 于乃文.
基于N-Gram静态分析技术的恶意软件分类研究
Study on Malware Classification Based on N-Gram Static Analysis Technology
计算机科学, 2022, 49(8): 336-343. https://doi.org/10.11896/jsjkx.210900203
[11] 陈明鑫, 张钧波, 李天瑞.
联邦学习攻防研究综述
Survey on Attacks and Defenses in Federated Learning
计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079
[12] 刘丽, 李仁发.
医疗CPS协作网络控制策略优化
Control Strategy Optimization of Medical CPS Cooperative Network
计算机科学, 2022, 49(6A): 39-43. https://doi.org/10.11896/jsjkx.210300230
[13] 李亚茹, 张宇来, 王佳晨.
面向超参数估计的贝叶斯优化方法综述
Survey on Bayesian Optimization Methods for Hyper-parameter Tuning
计算机科学, 2022, 49(6A): 86-92. https://doi.org/10.11896/jsjkx.210300208
[14] 赵璐, 袁立明, 郝琨.
多示例学习算法综述
Review of Multi-instance Learning Algorithms
计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047
[15] 侯夏晔, 陈海燕, 张兵, 袁立罡, 贾亦真.
一种基于支持向量机的主动度量学习算法
Active Metric Learning Based on Support Vector Machines
计算机科学, 2022, 49(6A): 113-118. https://doi.org/10.11896/jsjkx.210500034
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!