计算机科学 ›› 2019, Vol. 46 ›› Issue (9): 15-21.doi: 10.11896/j.issn.1002-137X.2019.09.002

• 综述 • 上一篇    下一篇

半监督聚类综述

秦悦1, 丁世飞1,2   

  1. (中国矿业大学计算机科学与技术学院 江苏 徐州221116)1;
    (中国科学院计算技术研究所智能信息处理重点实验室 北京100190)2
  • 收稿日期:2018-09-13 出版日期:2019-09-15 发布日期:2019-09-02
  • 通讯作者: 丁世飞(1963-),男,博士,教授,CCF理事,主要研究方向为模式识别与人工智能、机器学习与数据挖掘、粗糙集与软计算、粒度计算、感知与认知计算,E-mail:dingsf@cumt.edu.cn
  • 作者简介:秦 悦(1994-),女,硕士生,主要研究方向为人工智能、机器学习、聚类,E-mail:qinyue23333@163.com;
  • 基金资助:
    国家自然科学基金(61672522,61379101),国家重点基础研究计划(2013CB329502)

Survey of Semi-supervised Clustering

QIN Yue1, DING Shi-fei1,2   

  1. (School of Computer Science and Technology,China University of Mining and Technology,Xuzhou,Jiangsu 221116,China)1;
    (Key Laboratory of Intelligent Information Processing,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China)2
  • Received:2018-09-13 Online:2019-09-15 Published:2019-09-02

摘要: 半监督聚类是结合半监督学习与聚类分析而提出的新的学习方法,其在机器学习中得到了广泛的重视和应用。传统无监督聚类算法在划分数据时并不需要任何数据属性,但在实际应用中,存在少量带有独立类标签或成对约束的监督信息的数据样本,学者们致力于将这些为数不多的监督信息运用于聚类,以得到更优的聚类结果,从而提出了半监督聚类。文中主要介绍了半监督聚类的理论基础和算法思想,并对半监督聚类的最新研究进展进行了综述。首先,对半监督学习的研究现状和分类进行了概述,并将生成式半监督学习、半监督SVM、基于图的半监督学习和协同训练这4种分类方法进行了对比;其次,针对半监督学习的聚类进行了详细的描述,并对4种典型半监督聚类算法(Cop-Kmeans算法、LCop-Kmeans算法、Seeded-Kmeans算法和SC-Kmeans算法)的算法思想进行了分析和总结,同时对这4种算法的优缺点进行了评价;然后,按照基于约束的半监督聚类和基于距离的半监督聚类两种情况,分别对半监督聚类的研究现状进行了阐述;最后,探讨了半监督聚类在生物信息学、图像分割以及计算机其他领域内的应用以及未来的研究方向。文中旨在使初学者能够快速了解半监督聚类的进展,理解典型的算法思想,并在之后的实际应用中能起到一定的指导作用。

关键词: 半监督学习, 聚类, 成对约束, 标签, 半监督聚类, 机器学习

Abstract: Semi-supervised clustering is a new learning method combining semi-supervised learning and clustering analysis,and it has been used widely in machine learning.The traditional unsupervised clustering algorithms do not need any data attributes when dividing data,but in practical applications,there are a small number of data samples for supervised information with independent class labels or paired constraints,so scholars are committed to applying these few supervised information into clustering to obtain better clustering results,thus proposing semi-supervised clustering.This paper mainly introduced the theoretical basis and algorithm ideas of semi-supervised clustering,and summarized the latest progress of semi-supervised clustering.Firstly,the current situation and classification of semi-supervised learning were reviewed,and the generative semi-supervised learning,semi-supervised SVM,semi-supervised learning based on graph and collaborative training were compared.Secondly,the clustering of semi-supervised learning was described in detail,four typical semi-supervised clustering algorithms (Cop-Kemans algorithm,LCop-Kmeans algorithm,Seeded-Kmeans algorithm and SC-Kmeans algorithm) were analyzed and summarized,and their advantages and disadvantages were eva-luated.Then,according to the two situations of semi-supervised clustering based on constraints and the semi-supervised clustering based on distance,the research status of semi-supervised clustering was expounded respectively.Finally,the applications of semi-supervised clustering in bioinformatics,image segmentation and other fields of computer and the future research directions were discussed.This paper aims to enable beginners to quickly know about the progress of semi-supervised clustering and understand the typical algorithm ideas,and it can play a guiding role in actual applications afterwards.

Key words: Semi-supervised learning, Clustering, Pairwise constraints, Label, Semi-supervised clustering, Machinelearning

中图分类号: 

  • TP181
[1]HARTIGAN J A,WONG M A.Algorithm AS 136:A k-means clustering algorithm[J].Applied Statistics,1979,28(1):100-108.
[2]MADDAH M,CRIMSON W E L,WARFIELD S K.Statistical modeling and EM clustering of white matter fiber tracts[C]//3rd IEEE International Symposium on Biomedical Imaging:Nano to Macro.New York:IEEE Press,2006:53-56.
[3]LI K L,CAO Z,CAO L P,et al.Some Developments on Semi-Supervised Clustering [J].Pattern Recognition and Artificial Intelligence,2009,22(5):735-742.(in Chinese)李昆仑,曹铮,曹丽苹,等.半监督聚类的若干新进展[J].模式识别与人工智能,2009,22(5):735-742.
[4]XIONG J B,LI Z K,LIU Y J.Research on the Present Situation of Semi-Supervised Clustering Algorithm[J].Modern Compu-ter,2009(12):61-64,77.(in Chinese)熊建斌,李振坤,刘怡俊.半监督聚类算法研究现状[J].现代计算机(专业版),2009(12):61-64,77.
[5]LIU J W,LIU Y,LUO X L.Semi-Supervised Learning Methods[J].Chinese Journal of Computers,2015,38(8):1592-1617.(in Chinese)刘建伟,刘媛,罗雄麟.半监督学习方法[J].计算机学报,2015,38(8):1592-1617.
[6]SCUDDER H I.Probability of error of some adaptive pattern-recognition machines[J].Information Theory IEEE Transactions on,1965,11(3):363-371.
[7]FRALICK S.Learning to recognize patterns without a teacher[J].IEEE Transactions on Information Theory,2003,13(1):57-64.
[8]AGRAWALA A.Learning with a probabilistic teacher[J].IEEE Transactions on Information Theory,1970,16(4):373-379.
[9]MERZ C J,CLAIR D C,BOND W E.Semi-supervised adaptive resonance theory (SMART2)[C]//International Joint Confe-rence on Neural Networks.Baltimore:IEEE Press,1992:851-856.
[10]SHAHSHAHANI B M,LANDGREBE D.The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon[J].IEEE Transactions on Geoscience & Remote Sensing,1994,32(5):1087-1095.
[11]KINGMA D P,REZENDE D J,MOHAME D S.Semi-Supervised Learning with Deep Generative Models[J].Advances in Neural Information Processing Systems,2014,4:3581-3589.
[12]KLEIN D,KAMVAR S D,MANNING C D.From Instance-le-vel Constraints to Space-Level Constraints:Making the Most of Prior Knowledge in Data Clustering[C]//Proceedings of the Nineteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2002:307-314.
[13]CHENG S,SHI Y,QIN Q.Particle swarm optimization based semi-supervised learning on Chinese text categorization//IEEE Congress on Evolutionary Computation.New York:IEEE Press,2012:1-8.
[14]WANG J,KUMAR S,CHANG S F.Semi-supervised hashing for scalable image retrieval[C]//IEEE Conference on Computer Vision and Pattern Recognition.San Francisco:DBLP,2010:3424-3431.
[15]CHEN S G,ZHANG D Q.Experimental Comparisons of Semi-Supervised Dimensional Reduction Methods [J].Journal of software,2011,22(1):28-43.(in Chinese)陈诗国,张道强.半监督降维方法的实验比较[J].软件学报,2011,22(1):28-43.
[16]ZHOU Z H,LI M.Semi-supervised regression with co-training[C]//International Joint Conference on Artificial Intelligence.San Francisco:Morgan Kaufmann Publishers Inc,2005:908-913.
[17]MEHRKANOON S,ALZATE C,MALL R,et al.MulticlassSemi-supervised Learning Based Upon Kernel Spectral Clustering[J].IEEE Transactions on Neural Networks & Learning Systems,2015,26(4):720.
[18]CALLUT J,FRANCOISSE K,SAERENS M,et al.Semi-supervised Classification from Discriminative Random Walks//ECML PKDD 2008.Berlin:Springer,2008:162-177.
[19]周志华.Machine learning.北京:清华大学出版社,2016.
[20]COZMAN F G,COHEN I.Unlabeled Data Can Degrade Classification Performance of Generative Classifiers[C]//Fifteenth International Florida Artificial Intelligence Society Conference.California:AAAI Press,2009:327-331.
[21]CASTELLI V,COVER T M.On the exponential value of la-beled samples.Elsevier Science Inc,1995.
[22]VAPNIK V,STERIN A.On structural risk minimization or overall risk in a problem of pattern recognition[J].Automation &Remote Control,1977,10(10):1495-1503.
[23]WANG X,YU H.How to Break MD5 and Other Hash Functions[M]//Advances in Cryptology-EUROCRYPT 2005.DBLP,2005:19-35.
[24]BLUM A,CHAWLA S.Learning from Labeled and Unlabeled Data using Graph Min-cuts[C]//Eighteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2001:19-26.
[25]BELKIN M,NIYOGE P,SINDHWANI V.Manifold Regularization:A Geometric Framework for Learning from Labeled and Unlabeled Examples[J].Journal of Machine Learning Research,2006,7(1):2399-2434.
[26]BLUM A,MITCHELL T.Combining labeled and unlabeled data with co-training[C]//Proceedings of the 11th Annual Confe-rence on Computational Learning Theory (COLT98).Wisconsin,Ml,1998:92-100.
[27]COLDMAN S,ZHOU Y.Enhancing supervised learning withunlabeled data[C]//Proceedings of the 17th International Conference on Machine Learning(ICML’00).San Francisco:CA,2000:327-334.
[28]WAGSTAFF K,CARDIE C.Clustering with instance-level constraints[C]//Proceedings of 17th International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2000:1097-1103.
[29]WAGSTAFF K,CARDIE C,ROGERS S,et al.Constrained K-means Clustering with Background Knowledge[C]//Eighteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2001:577-584.
[30]YANG Y,TAN W,LI T.Consensus clustering based on con-strained self-organizing map and improved Cop-Kmeans ensemble in intelligent decision support systems[J].Knowledge Based Systems,2012,32:101-115.
[31]BASU S,BANERJEE A,MOONEY R.Semi-Supervised Clustering by Seeding[C]//Proceedings of 19th InternationalConfe-rence on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2002:19-26.
[32]CHEN Z Y,WANG H J,HU M,et al.An active semi-super-vised clustering algorithm based on seeds set and pairwise constraints[J].Journal of Jilin University(Science Edition),2017,55(3):664-672.(in Chinese)陈志雨,王慧君,胡明,等.一种基于Seeds集和成对约束的主动半监督聚类算法[J].吉林大学学报(理学版),2017,55(3):664-672.
[33]ZHENG L,LI T.Semi-supervised Hierarchical Clustering[C]//International Conference on Data Mining.2011.
[34]ZHU Y,QIAN J H,JI Z B.An improved COP-Kmeans algo-rithm based on BFS.Beijing:China science and technology paper online .http://www.paper.edu.cn/releasepaper/content/201507-93.(in Chinese)朱煜,钱景辉,季正波.改进的基于广度优先搜索的COP-Kmeans算法.北京:中国科技论文在线 .http://www.paper.edu.cn/releasepaper/content/201507-93.
[35] HE P,XU X H,LU L,et al.Semi-Supervised Clustering viaTwo-Level Random Walk[J].Journal of Software,2014,25(5):997-1013.(in Chinese)何萍,徐晓华,陆林,等.双层随机游走半监督聚类[J].软件学报,2014,25(5):997-1013.
[36]TANG Q,LIAO Z G.A Semi-Supervised Clustering MethodBased on Affinity Propagation Algorithm[J].Electronic Information Warfare Technology,2017,32(1):8-12.(in Chinese)汤琼,廖泽广.一种基于AP算法的半监督聚类方法[J].电子信息对抗技术,2017,32(1):8-12.
[37]YANG Y,RUTAYISIRE T,LIN C,et al.An Improved Cop-Kmeans Clustering for Solving Constraint Violation Based on MapReduce Framework[J].Fundamental Information,2013,126(4):301-318.
[38]LI C M,XU S B,HAO Z F.Cross-Entropy semi-supervisedclustering based on pairwise constraints[J].Pattern Recognition and Artificial Intelligence,2017,30(7):598-608.(in Chinese)李晁铭,徐圣兵,郝志峰.基于成对约束的交叉熵半监督聚类算法[J].模式识别与人工智能,2017,30(7):598-608.
[39]CHAI B F,LV F,LI W B,et al.Semi-supervised Kmeans Clustering Algorithm based on Active Learning Priors.[2018-11-25].http://kns.cnki.net/kcms/detail/51.1307.TP.20180719.1030.038.html.(in Chinese)柴变芳,吕峰,李文斌,等.基于主动学习先验的半监督Kmeans聚类算法.[2018-11-25].http://kns.cnki.net/kcms/detail/51.1307.TP.20180719.1030.038.html.
[40]BASU S,BILENKO M,MOONEY R J.A probabilistic framework for semi-supervised clustering[C]//Proceedings of the Tenth ACM 0SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-04).New York:MIT Press,2008:59-68.
[41]DING S,JIA H,DU M,et al.A semi-supervised approximate spectral clustering algorithm based on HMRF model[J].Information Sciences,2018,429:215-228.
[42]ALOK A K,SAHA S,EKBAL A.Feature Selection and Semi-supervised Clustering Using Multi-objective Optimization[J].Springer Plus,2014,3(1):465.
[43]WEI S,LI Z,ZHANG C.Combined constraint-based with metric-based in semi-supervised clustering ensemble[J].International Journal of Machine Learning & Cybernetics,2018,9(7):1085-1100.
[44]SAHA S,KAUSHIK K,ALOK A K,et al.Multi-objective semi-supervised clustering of tissue samples for cancer diagnosis[J].Soft Computing,2016,20(9):3381-3392.
[45]CHEN H S.Semi-supervised clustering ensemble for bio-mole-cular pattern mining.Guangzhou:South China University of Technology,2016.(in Chinese)陈弘晟.半监督聚类集成在生物分子模式挖掘中的应用.广州:华南理工大学,2016.
[46]OROZCO-DUQUE A,BUSTAMANTE J,CASTELLANOS-DOMINGUEZ G.Semi-supervised clustering of fractionated electrograms for electroanatomical atrial mapping[J].Biomedical Engineering Online,2016,15(1):44.
[47]AN Q Q,ZHANG F,LI Z X,et al.Research on image segmentation based on machine learning[J].Automation & Instrumentation,2018(6):29-31.(in Chinese)安强强,张峰,李赵兴,等.基于机器学习的图像分割研究[J].自动化与仪器仪表,2018(6):29-31.
[48]LI Q L.Semi-supervised clustering based on constraints for images segmentation.Xi’an:XiDian University,2014.(in Chinese)李巧兰.基于约束的半监督聚类的图像分割算法研究 .西安:西安电子科技大学,2014.
[49]LI Y W.Research on robust segmentation algorithm based onsemi-supervised fuzzy clustering.Xi’an:Xi’an University of Posts & Telecommunications,2018.(in Chinese)李亚文.鲁棒半监督模糊聚类分割算法研究.西安:西安邮电大学,2018.
[50]ALOK A K,SAHA S,EKBAL A.Multi-objective semi-super-vised clustering for automatic pixel classification from remote sensing imagery[J].Soft Computing,2016,20(12):4733-4751.
[51]FIORE U,PALMIERI F,CASTIGLIONE A,et al.Network anomaly detection with the restricted Boltzmann machine[J].Neuro Computing,2013,122:13-23.
[52]LIANG C,LI C H.Novel Intrusion Detection Method Based on Semi-supervised Clustering[J].Computer Science,2016,43(5):87-90.(in Chinese)梁辰,李成海.一种新的半监督入侵检测方法[J].计算机科学,2016,43(5):87-90.
[53]PENG T L,ZHANG W J,LAN J L,et al.Micro video annotation method based on semi-supervised clustering[J].Application Research of Computers,2016,33(3):948-952.(in Chinese)彭太乐,张文俊,蓝建梁,等.基于半监督聚类的微视频标注方法[J].计算机应用研究,2016,33(3):948-952.
[54]ZHONG S.Semi-supervised model-based document clustering:A comparative study[J].Machine Learning,2006,65(1):3-29.
[55]CHENG X M,YANG Q H,ZHAI Y P,et al.Test Case Selection Technique Base on Semi-supervised Clustering Method[J].Computer Science,2018,45(1):249-254.(in Chinese)程雪梅,杨秋辉,翟宇鹏,等.基于半监督聚类方法的测试用例选择技术[J].计算机科学,2018,45(1):249-254.
[1] 梁伟, 段晓东, 徐健锋. 基于差异性度量的基础聚类三支过滤算法[J]. 计算机科学, 2021, 48(1): 136-144.
[2] 张煜, 陆亿红, 黄德才. 基于密度峰值的加权犹豫模糊聚类算法[J]. 计算机科学, 2021, 48(1): 145-151.
[3] 陈洁婷, 王维莹, 金琴. 弹幕信息协助下的视频多标签分类[J]. 计算机科学, 2021, 48(1): 167-174.
[4] 李吟, 李必信. 基于脚本预测和重组的内存泄漏测试加速技术[J]. 计算机科学, 2020, 47(9): 31-39.
[5] 丁钰, 魏浩, 潘志松, 刘鑫. 网络表示学习算法综述[J]. 计算机科学, 2020, 47(9): 52-59.
[6] 苏畅, 张定权, 谢显中, 谭娅. 面向5G通信网络的NFV内存资源管理方法[J]. 计算机科学, 2020, 47(9): 246-251.
[7] 杨帆, 王俊斌, 白亮. 基于安全性的成对约束扩充算法[J]. 计算机科学, 2020, 47(9): 324-329.
[8] 徐守坤, 倪楚涵, 吉晨晨, 李宁. 基于YOLOv3的施工场景安全帽佩戴的图像描述[J]. 计算机科学, 2020, 47(8): 233-240.
[9] 王慧, 乐孜纯, 龚轩, 武玉坤, 左浩. 基于特征分类的链路预测方法综述[J]. 计算机科学, 2020, 47(8): 302-312.
[10] 陈庆超, 王韬, 冯文博, 尹世庄, 刘丽君. 基于最长连续间隔的未知二进制协议格式推断[J]. 计算机科学, 2020, 47(8): 313-318.
[11] 曹素娥, 杨泽民. 基于聚类分析算法和优化支持向量机的无线网络流量预测[J]. 计算机科学, 2020, 47(8): 319-322.
[12] 高方远, 王秀美. 一种基于块对角表示和近邻约束的子空间聚类方法[J]. 计算机科学, 2020, 47(7): 66-70.
[13] 李向利, 贾梦雪. 基于预处理的超图非负矩阵分解算法[J]. 计算机科学, 2020, 47(7): 71-77.
[14] 袁野, 和晓歌, 朱定坤, 王富利, 谢浩然, 汪俊, 魏明强, 郭延文. 视觉图像显著性检测综述[J]. 计算机科学, 2020, 47(7): 84-91.
[15] 高玉潼, 雷为民, 原玥. 复杂环境下基于聚类分析的人脸目标识别[J]. 计算机科学, 2020, 47(7): 111-117.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[8] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[9] 杨羽琦,章国安,金喜龙. 车载自组织网络中基于车辆密度的双簇头路由协议[J]. 计算机科学, 2018, 45(4): 126 -130 .
[10] 施超,谢在鹏,柳晗,吕鑫. 基于稳定匹配的容器部署策略的优化[J]. 计算机科学, 2018, 45(4): 131 -136 .