Computer Science ›› 2016, Vol. 43 ›› Issue (12): 183-188.doi: 10.11896/j.issn.1002-137X.2016.12.033

Previous Articles     Next Articles

Pairwise Constrained Semi-supervised Text Clustering Algorithm

WANG Zong-hu and LIU Su   

  • Online:2018-12-01 Published:2018-12-01

Abstract: Semi-supervised clustering can use a small amount of tag data to improve the clustering performance,but most of the text clustering algorithms can not directly apply priori information such as pairwise constraints.As the characteristics of text data were high-dimensional and sparse,we proposed a semi-supervised document clustering algorithm.First,pairwise constraints were expanded and embedded in the document similarity matrix,then K density regions which have a small similarity with the already partitioned text collection were gradually searched in the remaining unpartitioned text collection as initial centroid.The remaining unpartitioned texts which are relatively difficult to distinguish were assigned to the K initial centroid according to the constraints.Finally,the clustering result was optimized by the convergence criterion function through integration of punish violations of pairwise constraints.In the clustering process,it can automatically determines the initial centroids to avoid the sensitivity to the initial centroids of K-means algorithm.Experimental results show that the proposed algorithm can effectively use a small amount of pairwise constraints to improve the clustering performance in Chinese and English text datasets.

Key words: Clustering,Semi-supervised,VSM,Pairwise constraints,Text

[1] Wagstaff K,Cardie C.Clustering with Instance-Level Const-raints[C]∥Proc of the17th International Conference on Machine Learning,San Francisco,USA.2000:1103-1110
[2] Li Xue-mei,Wang Li-hong,Song Yi-bin.A Hybrid Constrained Semi-Supervised Clustering Algorithm[J].Pattern Recognition and Artificial Intelligence,2011,4(4):452-456(in Chinese) 李雪梅,王立宏,宋宜斌.一种混合约束的半监督聚类算法[J].模式识别与人工智能,2011,4(4):452-456
[3] Li Kun-lun,Cao Zheng,Cao Li-ping,et al.Some Developments on Semi-Supervised Clustering[J].Pattern Recognition and Artificial Intelligence,2009,22(5):735-742(in Chinese) 李昆仑,曹铮,曹丽苹,等.半监督聚类的若干新进展[J].模式识别与人工智能,2009,22(5):735-742
[4] Wang Ling,Bo Lie-feng,Jiao Li-cheng.Density-Sensitive Semi-Supervised Spectral Clustering[J].Journal of Software,2007,18(10):2412-2422(in Chinese) 王玲,薄列峰,焦李成.密度敏感的半监督谱聚类[J].软件学报,2007,18(10):2412-2422
[5] Yin Xue-song,Hu En-liang,Chen Song-can.Discriminative Semi-Supervised Clustering Analysis with Pairwise Constraints[J].Journal of Software,2008,19(11):2791-2802(in Chinese) 尹学松,胡恩良,陈松灿.基于成对约束的判别型半监督聚类分析[J].软件学报,2008,19(11):2791-2802
[6] Xiao Yu,Yu Jian.Semi-Supervised Clustering Based on Affinity Propagation Algorithm[J].Journal of Software,2008,19 (11):2803-2813(in Chinese) 肖宇,于剑.基于近邻传播算法的半监督聚类[J].软件学报,2008,19(11):2803-2813
[7] Huang R Z,Lamb W.An active framework for semi-supervised document clustering with languages modeling[J].Data &Knowledge Engineering,2008,8(1):1136-1155
[8] Wang J LWu S Y,Vu H Q,et al.Text document clustering with metric learning[C]∥Proc.of the 33rd Annual Intl.ACM SIGIR Conf.on Research and Development in Information Retrieval.2010:783-784
[9] Shi Zhong.Semi-supervised model-based document clustering:A comparative study[J].Machine Learing,2006,5(3):3-29
[10] Higgs R E,Bemis K G,Watson I A,et al.Experi-mental designs for selecting molecules from large chemical databases[J].Journal of Chemical Information and Computer Sciences,1997,37(5):861-870
[11] Qin Yu,Jing Ji-wu,Xiang Ji,et al.An improved K-means algorithm based on optimizing initial points[J].Journal of the Gra-duate School of the Chinese Academy of Science,2007,24(6):771-777(in Chinese) 秦钰,荆继武,向继,等.基于优化初始类中心点的K-means改进算法[J].中国科学院研究生院学报,2007,24(6):771-777
[12] Pantel P,Lin D.Document clustering with committees[C]∥Proceedings of the 25 th International Conference on Research and Development in Information Retrieval.New York,NY,USA:ACM Press,2002:199-206
[13] Wang Zhong,Liu Gui-quan,Chen En-hong.A K-means Algo-rithm Based on Optimized Initial Center Points[J].Pattern Re-cognition and Artificial Intelligence,2009,22(2):299-304(in Chinese) 汪中,刘贵全,陈恩红.一种优化初始中心点的K-means算法[J].模式识别与人工智能,2009,22(2):299-304
[14] Klein D,Kamvar S D,Manning C.From instance-level con-straints to space-level constraints:Making the most of prior knowledge in data clustering[C]∥Proceedings of the Nineteenth International Conference on Machine Learning.2002:307-314
[15] Bilenko M,Basu S,Mooney R J.Integrating Constraints andMetric Learning in Semi-Supervised Clustering[C]∥Proc of the 21st International Conference on Machine Learning.2004:81-88
[16] Karypis G.CLUTO:a clustering toolkit.
[17] Wang Na,Li Xia.Active Semi-Supervised Spectral ClusteringBased on Pairwise Constraints[J].Acta Electronica Sinica,2010,38(1):172-176(in Chinese) 王娜,李霞.基于监督信息特性的主动半监督谱聚类算法[J].电子学报,2010,8(1):172-176
[18] Su Ying-bin,Du Xue-hui,et al.Sensitive Information Inference Method Based on Semi-supervised Document Clustering[J].Computer Science,2015,2(10):132-137(in Chinese) 苏嬴彬,杜学绘,等.基于半监督聚类的文档敏感信息推导方法[J].计算机科学,2015,2(10):132-137
[19] Zhao Wei-zhong,Ma Hui-fang,Li Zhi-qing,et al.Efficiently active learning for Semi-Supervised document clustering[J].Journal of Software,2012,3(6):1486-1499(in Chinese) 赵卫中,马慧芳,李志清,等.一种结合主动学习的半监督文档聚类算法[J].软件学报,2012,3(6):1486-1499
[20] Xiong S,Azimi J,Fem X Z.Active learing of constraints for se-mi-supervised clustering[J].IEEE Transactions on know-ledge and Data Engineering,2014,6(1):43-45

No related articles found!
Full text



No Suggested Reading articles found!