计算机科学 ›› 2016, Vol. 43 ›› Issue (12): 183-188.doi: 10.11896/j.issn.1002-137X.2016.12.033
王纵虎,刘速
WANG Zong-hu and LIU Su
摘要: 半监督聚类能利用少量标记数据来提高聚类算法性能,但大部分文本聚类算法无法直接应用成对约束等先验信息。针对文本数据高维稀疏的特点,提出了一种半监督文本聚类算法。将成对约束信息扩展后嵌入文档相似度矩阵,在此基础上根据已划分与未划分文档之间的统计信息逐步找出剩余未划分文本集合中密集的且与已划分聚类中心集合相似度较小的K个初始聚类中心集合,然后将剩余的相对较难区分的文档结合成对约束限制信息划分到K个初始聚类中心集合,最后通过融合成对约束违反惩罚的收敛准则函数对聚类结果进行进一步优化。算法在聚类过程中自动确定初始聚类中心集合,避免了K均值算法对初始聚类中心选择的敏感性。在几个中英文数据集上的实验结果表明,所提算法能有效地利用少量的成对约束先验信息提高聚类效果。
[1] Wagstaff K,Cardie C.Clustering with Instance-Level Const-raints[C]∥Proc of the17th International Conference on Machine Learning,San Francisco,USA.2000:1103-1110 [2] Li Xue-mei,Wang Li-hong,Song Yi-bin.A Hybrid Constrained Semi-Supervised Clustering Algorithm[J].Pattern Recognition and Artificial Intelligence,2011,4(4):452-456(in Chinese) 李雪梅,王立宏,宋宜斌.一种混合约束的半监督聚类算法[J].模式识别与人工智能,2011,4(4):452-456 [3] Li Kun-lun,Cao Zheng,Cao Li-ping,et al.Some Developments on Semi-Supervised Clustering[J].Pattern Recognition and Artificial Intelligence,2009,22(5):735-742(in Chinese) 李昆仑,曹铮,曹丽苹,等.半监督聚类的若干新进展[J].模式识别与人工智能,2009,22(5):735-742 [4] Wang Ling,Bo Lie-feng,Jiao Li-cheng.Density-Sensitive Semi-Supervised Spectral Clustering[J].Journal of Software,2007,18(10):2412-2422(in Chinese) 王玲,薄列峰,焦李成.密度敏感的半监督谱聚类[J].软件学报,2007,18(10):2412-2422 [5] Yin Xue-song,Hu En-liang,Chen Song-can.Discriminative Semi-Supervised Clustering Analysis with Pairwise Constraints[J].Journal of Software,2008,19(11):2791-2802(in Chinese) 尹学松,胡恩良,陈松灿.基于成对约束的判别型半监督聚类分析[J].软件学报,2008,19(11):2791-2802 [6] Xiao Yu,Yu Jian.Semi-Supervised Clustering Based on Affinity Propagation Algorithm[J].Journal of Software,2008,19 (11):2803-2813(in Chinese) 肖宇,于剑.基于近邻传播算法的半监督聚类[J].软件学报,2008,19(11):2803-2813 [7] Huang R Z,Lamb W.An active framework for semi-supervised document clustering with languages modeling[J].Data &Knowledge Engineering,2008,8(1):1136-1155 [8] Wang J LWu S Y,Vu H Q,et al.Text document clustering with metric learning[C]∥Proc.of the 33rd Annual Intl.ACM SIGIR Conf.on Research and Development in Information Retrieval.2010:783-784 [9] Shi Zhong.Semi-supervised model-based document clustering:A comparative study[J].Machine Learing,2006,5(3):3-29 [10] Higgs R E,Bemis K G,Watson I A,et al.Experi-mental designs for selecting molecules from large chemical databases[J].Journal of Chemical Information and Computer Sciences,1997,37(5):861-870 [11] Qin Yu,Jing Ji-wu,Xiang Ji,et al.An improved K-means algorithm based on optimizing initial points[J].Journal of the Gra-duate School of the Chinese Academy of Science,2007,24(6):771-777(in Chinese) 秦钰,荆继武,向继,等.基于优化初始类中心点的K-means改进算法[J].中国科学院研究生院学报,2007,24(6):771-777 [12] Pantel P,Lin D.Document clustering with committees[C]∥Proceedings of the 25 th International Conference on Research and Development in Information Retrieval.New York,NY,USA:ACM Press,2002:199-206 [13] Wang Zhong,Liu Gui-quan,Chen En-hong.A K-means Algo-rithm Based on Optimized Initial Center Points[J].Pattern Re-cognition and Artificial Intelligence,2009,22(2):299-304(in Chinese) 汪中,刘贵全,陈恩红.一种优化初始中心点的K-means算法[J].模式识别与人工智能,2009,22(2):299-304 [14] Klein D,Kamvar S D,Manning C.From instance-level con-straints to space-level constraints:Making the most of prior knowledge in data clustering[C]∥Proceedings of the Nineteenth International Conference on Machine Learning.2002:307-314 [15] Bilenko M,Basu S,Mooney R J.Integrating Constraints andMetric Learning in Semi-Supervised Clustering[C]∥Proc of the 21st International Conference on Machine Learning.2004:81-88 [16] Karypis G.CLUTO:a clustering toolkit.http://glaros.dtc.umn.edu/gkhome/cluto/cluto [17] Wang Na,Li Xia.Active Semi-Supervised Spectral ClusteringBased on Pairwise Constraints[J].Acta Electronica Sinica,2010,38(1):172-176(in Chinese) 王娜,李霞.基于监督信息特性的主动半监督谱聚类算法[J].电子学报,2010,8(1):172-176 [18] Su Ying-bin,Du Xue-hui,et al.Sensitive Information Inference Method Based on Semi-supervised Document Clustering[J].Computer Science,2015,2(10):132-137(in Chinese) 苏嬴彬,杜学绘,等.基于半监督聚类的文档敏感信息推导方法[J].计算机科学,2015,2(10):132-137 [19] Zhao Wei-zhong,Ma Hui-fang,Li Zhi-qing,et al.Efficiently active learning for Semi-Supervised document clustering[J].Journal of Software,2012,3(6):1486-1499(in Chinese) 赵卫中,马慧芳,李志清,等.一种结合主动学习的半监督文档聚类算法[J].软件学报,2012,3(6):1486-1499 [20] Xiong S,Azimi J,Fem X Z.Active learing of constraints for se-mi-supervised clustering[J].IEEE Transactions on know-ledge and Data Engineering,2014,6(1):43-45 |
No related articles found! |
|