Computer Science ›› 2019, Vol. 46 ›› Issue (9): 15-21.doi: 10.11896/j.issn.1002-137X.2019.09.002

• Surverys • Previous Articles     Next Articles

Survey of Semi-supervised Clustering

QIN Yue1, DING Shi-fei1,2   

  1. (School of Computer Science and Technology,China University of Mining and Technology,Xuzhou,Jiangsu 221116,China)1;
    (Key Laboratory of Intelligent Information Processing,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China)2
  • Received:2018-09-13 Online:2019-09-15 Published:2019-09-02

Abstract: Semi-supervised clustering is a new learning method combining semi-supervised learning and clustering analysis,and it has been used widely in machine learning.The traditional unsupervised clustering algorithms do not need any data attributes when dividing data,but in practical applications,there are a small number of data samples for supervised information with independent class labels or paired constraints,so scholars are committed to applying these few supervised information into clustering to obtain better clustering results,thus proposing semi-supervised clustering.This paper mainly introduced the theoretical basis and algorithm ideas of semi-supervised clustering,and summarized the latest progress of semi-supervised clustering.Firstly,the current situation and classification of semi-supervised learning were reviewed,and the generative semi-supervised learning,semi-supervised SVM,semi-supervised learning based on graph and collaborative training were compared.Secondly,the clustering of semi-supervised learning was described in detail,four typical semi-supervised clustering algorithms (Cop-Kemans algorithm,LCop-Kmeans algorithm,Seeded-Kmeans algorithm and SC-Kmeans algorithm) were analyzed and summarized,and their advantages and disadvantages were eva-luated.Then,according to the two situations of semi-supervised clustering based on constraints and the semi-supervised clustering based on distance,the research status of semi-supervised clustering was expounded respectively.Finally,the applications of semi-supervised clustering in bioinformatics,image segmentation and other fields of computer and the future research directions were discussed.This paper aims to enable beginners to quickly know about the progress of semi-supervised clustering and understand the typical algorithm ideas,and it can play a guiding role in actual applications afterwards.

Key words: Clustering, Label, Machinelearning, Pairwise constraints, Semi-supervised clustering, Semi-supervised learning

CLC Number: 

  • TP181
[1]HARTIGAN J A,WONG M A.Algorithm AS 136:A k-means clustering algorithm[J].Applied Statistics,1979,28(1):100-108.
[2]MADDAH M,CRIMSON W E L,WARFIELD S K.Statistical modeling and EM clustering of white matter fiber tracts[C]//3rd IEEE International Symposium on Biomedical Imaging:Nano to Macro.New York:IEEE Press,2006:53-56.
[3]LI K L,CAO Z,CAO L P,et al.Some Developments on Semi-Supervised Clustering [J].Pattern Recognition and Artificial Intelligence,2009,22(5):735-742.(in Chinese)李昆仑,曹铮,曹丽苹,等.半监督聚类的若干新进展[J].模式识别与人工智能,2009,22(5):735-742.
[4]XIONG J B,LI Z K,LIU Y J.Research on the Present Situation of Semi-Supervised Clustering Algorithm[J].Modern Compu-ter,2009(12):61-64,77.(in Chinese)熊建斌,李振坤,刘怡俊.半监督聚类算法研究现状[J].现代计算机(专业版),2009(12):61-64,77.
[5]LIU J W,LIU Y,LUO X L.Semi-Supervised Learning Methods[J].Chinese Journal of Computers,2015,38(8):1592-1617.(in Chinese)刘建伟,刘媛,罗雄麟.半监督学习方法[J].计算机学报,2015,38(8):1592-1617.
[6]SCUDDER H I.Probability of error of some adaptive pattern-recognition machines[J].Information Theory IEEE Transactions on,1965,11(3):363-371.
[7]FRALICK S.Learning to recognize patterns without a teacher[J].IEEE Transactions on Information Theory,2003,13(1):57-64.
[8]AGRAWALA A.Learning with a probabilistic teacher[J].IEEE Transactions on Information Theory,1970,16(4):373-379.
[9]MERZ C J,CLAIR D C,BOND W E.Semi-supervised adaptive resonance theory (SMART2)[C]//International Joint Confe-rence on Neural Networks.Baltimore:IEEE Press,1992:851-856.
[10]SHAHSHAHANI B M,LANDGREBE D.The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon[J].IEEE Transactions on Geoscience & Remote Sensing,1994,32(5):1087-1095.
[11]KINGMA D P,REZENDE D J,MOHAME D S.Semi-Supervised Learning with Deep Generative Models[J].Advances in Neural Information Processing Systems,2014,4:3581-3589.
[12]KLEIN D,KAMVAR S D,MANNING C D.From Instance-le-vel Constraints to Space-Level Constraints:Making the Most of Prior Knowledge in Data Clustering[C]//Proceedings of the Nineteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2002:307-314.
[13]CHENG S,SHI Y,QIN Q.Particle swarm optimization based semi-supervised learning on Chinese text categorization//IEEE Congress on Evolutionary Computation.New York:IEEE Press,2012:1-8.
[14]WANG J,KUMAR S,CHANG S F.Semi-supervised hashing for scalable image retrieval[C]//IEEE Conference on Computer Vision and Pattern Recognition.San Francisco:DBLP,2010:3424-3431.
[15]CHEN S G,ZHANG D Q.Experimental Comparisons of Semi-Supervised Dimensional Reduction Methods [J].Journal of software,2011,22(1):28-43.(in Chinese)陈诗国,张道强.半监督降维方法的实验比较[J].软件学报,2011,22(1):28-43.
[16]ZHOU Z H,LI M.Semi-supervised regression with co-training[C]//International Joint Conference on Artificial Intelligence.San Francisco:Morgan Kaufmann Publishers Inc,2005:908-913.
[17]MEHRKANOON S,ALZATE C,MALL R,et al.MulticlassSemi-supervised Learning Based Upon Kernel Spectral Clustering[J].IEEE Transactions on Neural Networks & Learning Systems,2015,26(4):720.
[18]CALLUT J,FRANCOISSE K,SAERENS M,et al.Semi-supervised Classification from Discriminative Random Walks//ECML PKDD 2008.Berlin:Springer,2008:162-177.
[19]周志华.Machine learning.北京:清华大学出版社,2016.
[20]COZMAN F G,COHEN I.Unlabeled Data Can Degrade Classification Performance of Generative Classifiers[C]//Fifteenth International Florida Artificial Intelligence Society Conference.California:AAAI Press,2009:327-331.
[21]CASTELLI V,COVER T M.On the exponential value of la-beled samples.Elsevier Science Inc,1995.
[22]VAPNIK V,STERIN A.On structural risk minimization or overall risk in a problem of pattern recognition[J].Automation &Remote Control,1977,10(10):1495-1503.
[23]WANG X,YU H.How to Break MD5 and Other Hash Functions[M]//Advances in Cryptology-EUROCRYPT 2005.DBLP,2005:19-35.
[24]BLUM A,CHAWLA S.Learning from Labeled and Unlabeled Data using Graph Min-cuts[C]//Eighteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2001:19-26.
[25]BELKIN M,NIYOGE P,SINDHWANI V.Manifold Regularization:A Geometric Framework for Learning from Labeled and Unlabeled Examples[J].Journal of Machine Learning Research,2006,7(1):2399-2434.
[26]BLUM A,MITCHELL T.Combining labeled and unlabeled data with co-training[C]//Proceedings of the 11th Annual Confe-rence on Computational Learning Theory (COLT98).Wisconsin,Ml,1998:92-100.
[27]COLDMAN S,ZHOU Y.Enhancing supervised learning withunlabeled data[C]//Proceedings of the 17th International Conference on Machine Learning(ICML’00).San Francisco:CA,2000:327-334.
[28]WAGSTAFF K,CARDIE C.Clustering with instance-level constraints[C]//Proceedings of 17th International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2000:1097-1103.
[29]WAGSTAFF K,CARDIE C,ROGERS S,et al.Constrained K-means Clustering with Background Knowledge[C]//Eighteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2001:577-584.
[30]YANG Y,TAN W,LI T.Consensus clustering based on con-strained self-organizing map and improved Cop-Kmeans ensemble in intelligent decision support systems[J].Knowledge Based Systems,2012,32:101-115.
[31]BASU S,BANERJEE A,MOONEY R.Semi-Supervised Clustering by Seeding[C]//Proceedings of 19th InternationalConfe-rence on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2002:19-26.
[32]CHEN Z Y,WANG H J,HU M,et al.An active semi-super-vised clustering algorithm based on seeds set and pairwise constraints[J].Journal of Jilin University(Science Edition),2017,55(3):664-672.(in Chinese)陈志雨,王慧君,胡明,等.一种基于Seeds集和成对约束的主动半监督聚类算法[J].吉林大学学报(理学版),2017,55(3):664-672.
[33]ZHENG L,LI T.Semi-supervised Hierarchical Clustering[C]//International Conference on Data Mining.2011.
[34]ZHU Y,QIAN J H,JI Z B.An improved COP-Kmeans algo-rithm based on BFS.Beijing:China science and technology paper online . Chinese)朱煜,钱景辉,季正波.改进的基于广度优先搜索的COP-Kmeans算法.北京:中国科技论文在线 .
[35] HE P,XU X H,LU L,et al.Semi-Supervised Clustering viaTwo-Level Random Walk[J].Journal of Software,2014,25(5):997-1013.(in Chinese)何萍,徐晓华,陆林,等.双层随机游走半监督聚类[J].软件学报,2014,25(5):997-1013.
[36]TANG Q,LIAO Z G.A Semi-Supervised Clustering MethodBased on Affinity Propagation Algorithm[J].Electronic Information Warfare Technology,2017,32(1):8-12.(in Chinese)汤琼,廖泽广.一种基于AP算法的半监督聚类方法[J].电子信息对抗技术,2017,32(1):8-12.
[37]YANG Y,RUTAYISIRE T,LIN C,et al.An Improved Cop-Kmeans Clustering for Solving Constraint Violation Based on MapReduce Framework[J].Fundamental Information,2013,126(4):301-318.
[38]LI C M,XU S B,HAO Z F.Cross-Entropy semi-supervisedclustering based on pairwise constraints[J].Pattern Recognition and Artificial Intelligence,2017,30(7):598-608.(in Chinese)李晁铭,徐圣兵,郝志峰.基于成对约束的交叉熵半监督聚类算法[J].模式识别与人工智能,2017,30(7):598-608.
[39]CHAI B F,LV F,LI W B,et al.Semi-supervised Kmeans Clustering Algorithm based on Active Learning Priors.[2018-11-25]. Chinese)柴变芳,吕峰,李文斌,等.基于主动学习先验的半监督Kmeans聚类算法.[2018-11-25].
[40]BASU S,BILENKO M,MOONEY R J.A probabilistic framework for semi-supervised clustering[C]//Proceedings of the Tenth ACM 0SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-04).New York:MIT Press,2008:59-68.
[41]DING S,JIA H,DU M,et al.A semi-supervised approximate spectral clustering algorithm based on HMRF model[J].Information Sciences,2018,429:215-228.
[42]ALOK A K,SAHA S,EKBAL A.Feature Selection and Semi-supervised Clustering Using Multi-objective Optimization[J].Springer Plus,2014,3(1):465.
[43]WEI S,LI Z,ZHANG C.Combined constraint-based with metric-based in semi-supervised clustering ensemble[J].International Journal of Machine Learning & Cybernetics,2018,9(7):1085-1100.
[44]SAHA S,KAUSHIK K,ALOK A K,et al.Multi-objective semi-supervised clustering of tissue samples for cancer diagnosis[J].Soft Computing,2016,20(9):3381-3392.
[45]CHEN H S.Semi-supervised clustering ensemble for bio-mole-cular pattern mining.Guangzhou:South China University of Technology,2016.(in Chinese)陈弘晟.半监督聚类集成在生物分子模式挖掘中的应用.广州:华南理工大学,2016.
[46]OROZCO-DUQUE A,BUSTAMANTE J,CASTELLANOS-DOMINGUEZ G.Semi-supervised clustering of fractionated electrograms for electroanatomical atrial mapping[J].Biomedical Engineering Online,2016,15(1):44.
[47]AN Q Q,ZHANG F,LI Z X,et al.Research on image segmentation based on machine learning[J].Automation & Instrumentation,2018(6):29-31.(in Chinese)安强强,张峰,李赵兴,等.基于机器学习的图像分割研究[J].自动化与仪器仪表,2018(6):29-31.
[48]LI Q L.Semi-supervised clustering based on constraints for images segmentation.Xi’an:XiDian University,2014.(in Chinese)李巧兰.基于约束的半监督聚类的图像分割算法研究 .西安:西安电子科技大学,2014.
[49]LI Y W.Research on robust segmentation algorithm based onsemi-supervised fuzzy clustering.Xi’an:Xi’an University of Posts & Telecommunications,2018.(in Chinese)李亚文.鲁棒半监督模糊聚类分割算法研究.西安:西安邮电大学,2018.
[50]ALOK A K,SAHA S,EKBAL A.Multi-objective semi-super-vised clustering for automatic pixel classification from remote sensing imagery[J].Soft Computing,2016,20(12):4733-4751.
[51]FIORE U,PALMIERI F,CASTIGLIONE A,et al.Network anomaly detection with the restricted Boltzmann machine[J].Neuro Computing,2013,122:13-23.
[52]LIANG C,LI C H.Novel Intrusion Detection Method Based on Semi-supervised Clustering[J].Computer Science,2016,43(5):87-90.(in Chinese)梁辰,李成海.一种新的半监督入侵检测方法[J].计算机科学,2016,43(5):87-90.
[53]PENG T L,ZHANG W J,LAN J L,et al.Micro video annotation method based on semi-supervised clustering[J].Application Research of Computers,2016,33(3):948-952.(in Chinese)彭太乐,张文俊,蓝建梁,等.基于半监督聚类的微视频标注方法[J].计算机应用研究,2016,33(3):948-952.
[54]ZHONG S.Semi-supervised model-based document clustering:A comparative study[J].Machine Learning,2006,65(1):3-29.
[55]CHENG X M,YANG Q H,ZHAI Y P,et al.Test Case Selection Technique Base on Semi-supervised Clustering Method[J].Computer Science,2018,45(1):249-254.(in Chinese)程雪梅,杨秋辉,翟宇鹏,等.基于半监督聚类方法的测试用例选择技术[J].计算机科学,2018,45(1):249-254.
[1] CHAI Hui-min, ZHANG Yong, FANG Min. Aerial Target Grouping Method Based on Feature Similarity Clustering [J]. Computer Science, 2022, 49(9): 70-75.
[2] ZHENG Wen-ping, LIU Mei-lin, YANG Gui. Community Detection Algorithm Based on Node Stability and Neighbor Similarity [J]. Computer Science, 2022, 49(9): 83-91.
[3] LU Chen-yang, DENG Su, MA Wu-bin, WU Ya-hui, ZHOU Hao-hao. Federated Learning Based on Stratified Sampling Optimization for Heterogeneous Clients [J]. Computer Science, 2022, 49(9): 183-193.
[4] WU Hong-xin, HAN Meng, CHEN Zhi-qiang, ZHANG Xi-long, LI Mu-hang. Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning [J]. Computer Science, 2022, 49(8): 12-25.
[5] LIU Dong-mei, XU Yang, WU Ze-bin, LIU Qian, SONG Bin, WEI Zhi-hui. Incremental Object Detection Method Based on Border Distance Measurement [J]. Computer Science, 2022, 49(8): 136-142.
[6] YU Shu-hao, ZHOU Hui, YE Chun-yang, WANG Tai-zheng. SDFA:Study on Ship Trajectory Clustering Method Based on Multi-feature Fusion [J]. Computer Science, 2022, 49(6A): 256-260.
[7] MAO Sen-lin, XIA Zhen, GENG Xin-yu, CHEN Jian-hui, JIANG Hong-xia. FCM Algorithm Based on Density Sensitive Distance and Fuzzy Partition [J]. Computer Science, 2022, 49(6A): 285-290.
[8] CHEN Jing-nian. Acceleration of SVM for Multi-class Classification [J]. Computer Science, 2022, 49(6A): 297-300.
[9] HOU Xia-ye, CHEN Hai-yan, ZHANG Bing, YUAN Li-gang, JIA Yi-zhen. Active Metric Learning Based on Support Vector Machines [J]. Computer Science, 2022, 49(6A): 113-118.
[10] HE Xi, HE Ke-tai, WANG Jin-shan, LIN Shen-wen, YANG Jing-lin, FENG Yu-chao. Analysis of Bitcoin Entity Transaction Patterns [J]. Computer Science, 2022, 49(6A): 502-507.
[11] Ran WANG, Jiang-tian NIE, Yang ZHANG, Kun ZHU. Clustering-based Demand Response for Intelligent Energy Management in 6G-enabled Smart Grids [J]. Computer Science, 2022, 49(6): 44-54.
[12] WANG Yu-fei, CHEN Wen. Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment [J]. Computer Science, 2022, 49(6): 127-133.
[13] ZHU Xu-dong, XIONG Yun. Study on Multi-label Image Classification Based on Sample Distribution Loss [J]. Computer Science, 2022, 49(6): 210-216.
[14] CHEN Jia-zhou, ZHAO Yi-bo, XU Yang-hui, MA Ji, JIN Ling-feng, QIN Xu-jia. Small Object Detection in 3D Urban Scenes [J]. Computer Science, 2022, 49(6): 238-244.
[15] XING Yun-bing, LONG Guang-yu, HU Chun-yu, HU Li-sha. Human Activity Recognition Method Based on Class Increment SVM [J]. Computer Science, 2022, 49(5): 78-83.
Full text



No Suggested Reading articles found!