计算机科学 ›› 2019, Vol. 46 ›› Issue (9): 15-21.doi: 10.11896/j.issn.1002-137X.2019.09.002

• 综述 • 上一篇    下一篇

半监督聚类综述

秦悦1, 丁世飞1,2   

  1. (中国矿业大学计算机科学与技术学院 江苏 徐州221116)1;
    (中国科学院计算技术研究所智能信息处理重点实验室 北京100190)2
  • 收稿日期:2018-09-13 出版日期:2019-09-15 发布日期:2019-09-02
  • 通讯作者: 丁世飞(1963-),男,博士,教授,CCF理事,主要研究方向为模式识别与人工智能、机器学习与数据挖掘、粗糙集与软计算、粒度计算、感知与认知计算,E-mail:dingsf@cumt.edu.cn
  • 作者简介:秦 悦(1994-),女,硕士生,主要研究方向为人工智能、机器学习、聚类,E-mail:qinyue23333@163.com;
  • 基金资助:
    国家自然科学基金(61672522,61379101),国家重点基础研究计划(2013CB329502)

Survey of Semi-supervised Clustering

QIN Yue1, DING Shi-fei1,2   

  1. (School of Computer Science and Technology,China University of Mining and Technology,Xuzhou,Jiangsu 221116,China)1;
    (Key Laboratory of Intelligent Information Processing,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China)2
  • Received:2018-09-13 Online:2019-09-15 Published:2019-09-02

摘要: 半监督聚类是结合半监督学习与聚类分析而提出的新的学习方法,其在机器学习中得到了广泛的重视和应用。传统无监督聚类算法在划分数据时并不需要任何数据属性,但在实际应用中,存在少量带有独立类标签或成对约束的监督信息的数据样本,学者们致力于将这些为数不多的监督信息运用于聚类,以得到更优的聚类结果,从而提出了半监督聚类。文中主要介绍了半监督聚类的理论基础和算法思想,并对半监督聚类的最新研究进展进行了综述。首先,对半监督学习的研究现状和分类进行了概述,并将生成式半监督学习、半监督SVM、基于图的半监督学习和协同训练这4种分类方法进行了对比;其次,针对半监督学习的聚类进行了详细的描述,并对4种典型半监督聚类算法(Cop-Kmeans算法、LCop-Kmeans算法、Seeded-Kmeans算法和SC-Kmeans算法)的算法思想进行了分析和总结,同时对这4种算法的优缺点进行了评价;然后,按照基于约束的半监督聚类和基于距离的半监督聚类两种情况,分别对半监督聚类的研究现状进行了阐述;最后,探讨了半监督聚类在生物信息学、图像分割以及计算机其他领域内的应用以及未来的研究方向。文中旨在使初学者能够快速了解半监督聚类的进展,理解典型的算法思想,并在之后的实际应用中能起到一定的指导作用。

关键词: 半监督学习, 聚类, 成对约束, 标签, 半监督聚类, 机器学习

Abstract: Semi-supervised clustering is a new learning method combining semi-supervised learning and clustering analysis,and it has been used widely in machine learning.The traditional unsupervised clustering algorithms do not need any data attributes when dividing data,but in practical applications,there are a small number of data samples for supervised information with independent class labels or paired constraints,so scholars are committed to applying these few supervised information into clustering to obtain better clustering results,thus proposing semi-supervised clustering.This paper mainly introduced the theoretical basis and algorithm ideas of semi-supervised clustering,and summarized the latest progress of semi-supervised clustering.Firstly,the current situation and classification of semi-supervised learning were reviewed,and the generative semi-supervised learning,semi-supervised SVM,semi-supervised learning based on graph and collaborative training were compared.Secondly,the clustering of semi-supervised learning was described in detail,four typical semi-supervised clustering algorithms (Cop-Kemans algorithm,LCop-Kmeans algorithm,Seeded-Kmeans algorithm and SC-Kmeans algorithm) were analyzed and summarized,and their advantages and disadvantages were eva-luated.Then,according to the two situations of semi-supervised clustering based on constraints and the semi-supervised clustering based on distance,the research status of semi-supervised clustering was expounded respectively.Finally,the applications of semi-supervised clustering in bioinformatics,image segmentation and other fields of computer and the future research directions were discussed.This paper aims to enable beginners to quickly know about the progress of semi-supervised clustering and understand the typical algorithm ideas,and it can play a guiding role in actual applications afterwards.

Key words: Semi-supervised learning, Clustering, Pairwise constraints, Label, Semi-supervised clustering, Machinelearning

中图分类号: 

  • TP181
[1]HARTIGAN J A,WONG M A.Algorithm AS 136:A k-means clustering algorithm[J].Applied Statistics,1979,28(1):100-108.
[2]MADDAH M,CRIMSON W E L,WARFIELD S K.Statistical modeling and EM clustering of white matter fiber tracts[C]//3rd IEEE International Symposium on Biomedical Imaging:Nano to Macro.New York:IEEE Press,2006:53-56.
[3]LI K L,CAO Z,CAO L P,et al.Some Developments on Semi-Supervised Clustering [J].Pattern Recognition and Artificial Intelligence,2009,22(5):735-742.(in Chinese)李昆仑,曹铮,曹丽苹,等.半监督聚类的若干新进展[J].模式识别与人工智能,2009,22(5):735-742.
[4]XIONG J B,LI Z K,LIU Y J.Research on the Present Situation of Semi-Supervised Clustering Algorithm[J].Modern Compu-ter,2009(12):61-64,77.(in Chinese)熊建斌,李振坤,刘怡俊.半监督聚类算法研究现状[J].现代计算机(专业版),2009(12):61-64,77.
[5]LIU J W,LIU Y,LUO X L.Semi-Supervised Learning Methods[J].Chinese Journal of Computers,2015,38(8):1592-1617.(in Chinese)刘建伟,刘媛,罗雄麟.半监督学习方法[J].计算机学报,2015,38(8):1592-1617.
[6]SCUDDER H I.Probability of error of some adaptive pattern-recognition machines[J].Information Theory IEEE Transactions on,1965,11(3):363-371.
[7]FRALICK S.Learning to recognize patterns without a teacher[J].IEEE Transactions on Information Theory,2003,13(1):57-64.
[8]AGRAWALA A.Learning with a probabilistic teacher[J].IEEE Transactions on Information Theory,1970,16(4):373-379.
[9]MERZ C J,CLAIR D C,BOND W E.Semi-supervised adaptive resonance theory (SMART2)[C]//International Joint Confe-rence on Neural Networks.Baltimore:IEEE Press,1992:851-856.
[10]SHAHSHAHANI B M,LANDGREBE D.The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon[J].IEEE Transactions on Geoscience & Remote Sensing,1994,32(5):1087-1095.
[11]KINGMA D P,REZENDE D J,MOHAME D S.Semi-Supervised Learning with Deep Generative Models[J].Advances in Neural Information Processing Systems,2014,4:3581-3589.
[12]KLEIN D,KAMVAR S D,MANNING C D.From Instance-le-vel Constraints to Space-Level Constraints:Making the Most of Prior Knowledge in Data Clustering[C]//Proceedings of the Nineteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2002:307-314.
[13]CHENG S,SHI Y,QIN Q.Particle swarm optimization based semi-supervised learning on Chinese text categorization//IEEE Congress on Evolutionary Computation.New York:IEEE Press,2012:1-8.
[14]WANG J,KUMAR S,CHANG S F.Semi-supervised hashing for scalable image retrieval[C]//IEEE Conference on Computer Vision and Pattern Recognition.San Francisco:DBLP,2010:3424-3431.
[15]CHEN S G,ZHANG D Q.Experimental Comparisons of Semi-Supervised Dimensional Reduction Methods [J].Journal of software,2011,22(1):28-43.(in Chinese)陈诗国,张道强.半监督降维方法的实验比较[J].软件学报,2011,22(1):28-43.
[16]ZHOU Z H,LI M.Semi-supervised regression with co-training[C]//International Joint Conference on Artificial Intelligence.San Francisco:Morgan Kaufmann Publishers Inc,2005:908-913.
[17]MEHRKANOON S,ALZATE C,MALL R,et al.MulticlassSemi-supervised Learning Based Upon Kernel Spectral Clustering[J].IEEE Transactions on Neural Networks & Learning Systems,2015,26(4):720.
[18]CALLUT J,FRANCOISSE K,SAERENS M,et al.Semi-supervised Classification from Discriminative Random Walks//ECML PKDD 2008.Berlin:Springer,2008:162-177.
[19]周志华.Machine learning.北京:清华大学出版社,2016.
[20]COZMAN F G,COHEN I.Unlabeled Data Can Degrade Classification Performance of Generative Classifiers[C]//Fifteenth International Florida Artificial Intelligence Society Conference.California:AAAI Press,2009:327-331.
[21]CASTELLI V,COVER T M.On the exponential value of la-beled samples.Elsevier Science Inc,1995.
[22]VAPNIK V,STERIN A.On structural risk minimization or overall risk in a problem of pattern recognition[J].Automation &Remote Control,1977,10(10):1495-1503.
[23]WANG X,YU H.How to Break MD5 and Other Hash Functions[M]//Advances in Cryptology-EUROCRYPT 2005.DBLP,2005:19-35.
[24]BLUM A,CHAWLA S.Learning from Labeled and Unlabeled Data using Graph Min-cuts[C]//Eighteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2001:19-26.
[25]BELKIN M,NIYOGE P,SINDHWANI V.Manifold Regularization:A Geometric Framework for Learning from Labeled and Unlabeled Examples[J].Journal of Machine Learning Research,2006,7(1):2399-2434.
[26]BLUM A,MITCHELL T.Combining labeled and unlabeled data with co-training[C]//Proceedings of the 11th Annual Confe-rence on Computational Learning Theory (COLT98).Wisconsin,Ml,1998:92-100.
[27]COLDMAN S,ZHOU Y.Enhancing supervised learning withunlabeled data[C]//Proceedings of the 17th International Conference on Machine Learning(ICML’00).San Francisco:CA,2000:327-334.
[28]WAGSTAFF K,CARDIE C.Clustering with instance-level constraints[C]//Proceedings of 17th International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2000:1097-1103.
[29]WAGSTAFF K,CARDIE C,ROGERS S,et al.Constrained K-means Clustering with Background Knowledge[C]//Eighteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2001:577-584.
[30]YANG Y,TAN W,LI T.Consensus clustering based on con-strained self-organizing map and improved Cop-Kmeans ensemble in intelligent decision support systems[J].Knowledge Based Systems,2012,32:101-115.
[31]BASU S,BANERJEE A,MOONEY R.Semi-Supervised Clustering by Seeding[C]//Proceedings of 19th InternationalConfe-rence on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2002:19-26.
[32]CHEN Z Y,WANG H J,HU M,et al.An active semi-super-vised clustering algorithm based on seeds set and pairwise constraints[J].Journal of Jilin University(Science Edition),2017,55(3):664-672.(in Chinese)陈志雨,王慧君,胡明,等.一种基于Seeds集和成对约束的主动半监督聚类算法[J].吉林大学学报(理学版),2017,55(3):664-672.
[33]ZHENG L,LI T.Semi-supervised Hierarchical Clustering[C]//International Conference on Data Mining.2011.
[34]ZHU Y,QIAN J H,JI Z B.An improved COP-Kmeans algo-rithm based on BFS.Beijing:China science and technology paper online .http://www.paper.edu.cn/releasepaper/content/201507-93.(in Chinese)朱煜,钱景辉,季正波.改进的基于广度优先搜索的COP-Kmeans算法.北京:中国科技论文在线 .http://www.paper.edu.cn/releasepaper/content/201507-93.
[35] HE P,XU X H,LU L,et al.Semi-Supervised Clustering viaTwo-Level Random Walk[J].Journal of Software,2014,25(5):997-1013.(in Chinese)何萍,徐晓华,陆林,等.双层随机游走半监督聚类[J].软件学报,2014,25(5):997-1013.
[36]TANG Q,LIAO Z G.A Semi-Supervised Clustering MethodBased on Affinity Propagation Algorithm[J].Electronic Information Warfare Technology,2017,32(1):8-12.(in Chinese)汤琼,廖泽广.一种基于AP算法的半监督聚类方法[J].电子信息对抗技术,2017,32(1):8-12.
[37]YANG Y,RUTAYISIRE T,LIN C,et al.An Improved Cop-Kmeans Clustering for Solving Constraint Violation Based on MapReduce Framework[J].Fundamental Information,2013,126(4):301-318.
[38]LI C M,XU S B,HAO Z F.Cross-Entropy semi-supervisedclustering based on pairwise constraints[J].Pattern Recognition and Artificial Intelligence,2017,30(7):598-608.(in Chinese)李晁铭,徐圣兵,郝志峰.基于成对约束的交叉熵半监督聚类算法[J].模式识别与人工智能,2017,30(7):598-608.
[39]CHAI B F,LV F,LI W B,et al.Semi-supervised Kmeans Clustering Algorithm based on Active Learning Priors.[2018-11-25].http://kns.cnki.net/kcms/detail/51.1307.TP.20180719.1030.038.html.(in Chinese)柴变芳,吕峰,李文斌,等.基于主动学习先验的半监督Kmeans聚类算法.[2018-11-25].http://kns.cnki.net/kcms/detail/51.1307.TP.20180719.1030.038.html.
[40]BASU S,BILENKO M,MOONEY R J.A probabilistic framework for semi-supervised clustering[C]//Proceedings of the Tenth ACM 0SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-04).New York:MIT Press,2008:59-68.
[41]DING S,JIA H,DU M,et al.A semi-supervised approximate spectral clustering algorithm based on HMRF model[J].Information Sciences,2018,429:215-228.
[42]ALOK A K,SAHA S,EKBAL A.Feature Selection and Semi-supervised Clustering Using Multi-objective Optimization[J].Springer Plus,2014,3(1):465.
[43]WEI S,LI Z,ZHANG C.Combined constraint-based with metric-based in semi-supervised clustering ensemble[J].International Journal of Machine Learning & Cybernetics,2018,9(7):1085-1100.
[44]SAHA S,KAUSHIK K,ALOK A K,et al.Multi-objective semi-supervised clustering of tissue samples for cancer diagnosis[J].Soft Computing,2016,20(9):3381-3392.
[45]CHEN H S.Semi-supervised clustering ensemble for bio-mole-cular pattern mining.Guangzhou:South China University of Technology,2016.(in Chinese)陈弘晟.半监督聚类集成在生物分子模式挖掘中的应用.广州:华南理工大学,2016.
[46]OROZCO-DUQUE A,BUSTAMANTE J,CASTELLANOS-DOMINGUEZ G.Semi-supervised clustering of fractionated electrograms for electroanatomical atrial mapping[J].Biomedical Engineering Online,2016,15(1):44.
[47]AN Q Q,ZHANG F,LI Z X,et al.Research on image segmentation based on machine learning[J].Automation & Instrumentation,2018(6):29-31.(in Chinese)安强强,张峰,李赵兴,等.基于机器学习的图像分割研究[J].自动化与仪器仪表,2018(6):29-31.
[48]LI Q L.Semi-supervised clustering based on constraints for images segmentation.Xi’an:XiDian University,2014.(in Chinese)李巧兰.基于约束的半监督聚类的图像分割算法研究 .西安:西安电子科技大学,2014.
[49]LI Y W.Research on robust segmentation algorithm based onsemi-supervised fuzzy clustering.Xi’an:Xi’an University of Posts & Telecommunications,2018.(in Chinese)李亚文.鲁棒半监督模糊聚类分割算法研究.西安:西安邮电大学,2018.
[50]ALOK A K,SAHA S,EKBAL A.Multi-objective semi-super-vised clustering for automatic pixel classification from remote sensing imagery[J].Soft Computing,2016,20(12):4733-4751.
[51]FIORE U,PALMIERI F,CASTIGLIONE A,et al.Network anomaly detection with the restricted Boltzmann machine[J].Neuro Computing,2013,122:13-23.
[52]LIANG C,LI C H.Novel Intrusion Detection Method Based on Semi-supervised Clustering[J].Computer Science,2016,43(5):87-90.(in Chinese)梁辰,李成海.一种新的半监督入侵检测方法[J].计算机科学,2016,43(5):87-90.
[53]PENG T L,ZHANG W J,LAN J L,et al.Micro video annotation method based on semi-supervised clustering[J].Application Research of Computers,2016,33(3):948-952.(in Chinese)彭太乐,张文俊,蓝建梁,等.基于半监督聚类的微视频标注方法[J].计算机应用研究,2016,33(3):948-952.
[54]ZHONG S.Semi-supervised model-based document clustering:A comparative study[J].Machine Learning,2006,65(1):3-29.
[55]CHENG X M,YANG Q H,ZHAI Y P,et al.Test Case Selection Technique Base on Semi-supervised Clustering Method[J].Computer Science,2018,45(1):249-254.(in Chinese)程雪梅,杨秋辉,翟宇鹏,等.基于半监督聚类方法的测试用例选择技术[J].计算机科学,2018,45(1):249-254.
[1] 吴振宇, 李云雷, 吴凡. 基于Tucker分解的半监督支持张量机[J]. 计算机科学, 2019, 46(9): 195-200.
[2] 孙国道, 周志秀, 李思, 刘义鹏, 梁荣华. 基于地理标签的推文话题时空演变的可视分析方法[J]. 计算机科学, 2019, 46(8): 42-49.
[3] 张会兵, 钟昊, 胡晓丽. 基于主题分析的用户评论聚类方法[J]. 计算机科学, 2019, 46(8): 50-55.
[4] 杨震, 王红军. 基于轨迹划分与密度聚类的移动用户重要地点识别方法[J]. 计算机科学, 2019, 46(8): 23-27.
[5] 蔡莉, 李英姿, 江芳, 梁宇. 面向城市热点区域的不平衡数据聚类挖掘研究[J]. 计算机科学, 2019, 46(8): 16-22.
[6] 孙书亚, 方欢, 方贤文. 日志诱导下的形态学片段流程聚类方法[J]. 计算机科学, 2019, 46(8): 71-77.
[7] 雒僖, 范九伦, 于海燕, 梁丹. 基于阴影集的截集式可能性C-均值聚类截集门限的选取[J]. 计算机科学, 2019, 46(8): 249-254.
[8] 刘长赟,杨宇迪,周丽华,赵丽红. 带有时间标签的流行社交位置发现[J]. 计算机科学, 2019, 46(7): 186-194.
[9] 王丽芳, 史超宇, 蔺素珍, 秦品乐, 高媛. 基于联合图像块聚类自适应字典学习的多模态医学图像融合[J]. 计算机科学, 2019, 46(7): 238-245.
[10] 李晓光, 邵超. 基于网格数据中心的密度峰值聚类算法[J]. 计算机科学, 2019, 46(6A): 457-460.
[11] 王鹏跃, 郭茂祖, 赵玲玲, 张昱. 城市空气质量感知方法综述[J]. 计算机科学, 2019, 46(6A): 35-40.
[12] 王楠, 孙善武. 基于半监督聚类分析的无人机故障识别[J]. 计算机科学, 2019, 46(6A): 192-195.
[13] 侯媛媛, 何儒汉, 李敏, 陈佳. 结合卷积神经网络多层特征融合和K-Means聚类的服装图像检索方法[J]. 计算机科学, 2019, 46(6A): 215-221.
[14] 李建军, 侯跃, 杨玉. 基于情景感知的用户兴趣推荐模型[J]. 计算机科学, 2019, 46(6A): 502-506.
[15] 黄海燕, 刘晓明, 孙华勇, 杨志才. 聚类分析算法在不确定性决策中的应用[J]. 计算机科学, 2019, 46(6A): 593-597.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[2] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[3] 朱淑芹,王文宏,李俊青. 针对基于感知器模型的混沌图像加密算法的选择明文攻击[J]. 计算机科学, 2018, 45(4): 178 -181, 189 .
[4] 戴文静, 袁家斌. 隐含子群问题的研究现状[J]. 计算机科学, 2018, 45(6): 1 -8 .
[5] 张景,朱国宾. 基于CBOW-LDA主题模型的Stack Overflow编程网站热点主题发现研究[J]. 计算机科学, 2018, 45(4): 208 -214 .
[6] 童泽平,李涛,李立杰,任亮. 基于随机需求与产能限制的供应链协同优化研究[J]. 计算机科学, 2018, 45(4): 260 -265 .
[7] 梁俊斌,周翔,王田,李陶深. 移动低占空比无线传感网中数据收集的研究进展[J]. 计算机科学, 2018, 45(4): 19 -24, 52 .
[8] 张文博,侯晓荣. 基于高斯分布的大气光估计算法[J]. 计算机科学, 2018, 45(4): 301 -305 .
[9] 郁湧,康庆怡,陈长赓,阚世林,骆永军. 基于内聚度和耦合度的二分K均值方法[J]. 计算机科学, 2018, 45(6A): 460 -464 .
[10] 吴伟男, 刘建明. 面向低功耗无线传感器网络的动态重传算法[J]. 计算机科学, 2018, 45(6): 96 -99,123 .