计算机科学 ›› 2016, Vol. 43 ›› Issue (12): 183-188.doi: 10.11896/j.issn.1002-137X.2016.12.033

• 数据挖掘 • 上一篇    下一篇

一种成对约束限制的半监督文本聚类算法

王纵虎,刘速   

  1. 中国人民大学统计学院 北京100872;中国石油规划总院计算机信息中心 北京102206,中国石油规划总院计算机信息中心 北京102206
  • 出版日期:2018-12-01 发布日期:2018-12-01

Pairwise Constrained Semi-supervised Text Clustering Algorithm

WANG Zong-hu and LIU Su   

  • Online:2018-12-01 Published:2018-12-01

摘要: 半监督聚类能利用少量标记数据来提高聚类算法性能,但大部分文本聚类算法无法直接应用成对约束等先验信息。针对文本数据高维稀疏的特点,提出了一种半监督文本聚类算法。将成对约束信息扩展后嵌入文档相似度矩阵,在此基础上根据已划分与未划分文档之间的统计信息逐步找出剩余未划分文本集合中密集的且与已划分聚类中心集合相似度较小的K个初始聚类中心集合,然后将剩余的相对较难区分的文档结合成对约束限制信息划分到K个初始聚类中心集合,最后通过融合成对约束违反惩罚的收敛准则函数对聚类结果进行进一步优化。算法在聚类过程中自动确定初始聚类中心集合,避免了K均值算法对初始聚类中心选择的敏感性。在几个中英文数据集上的实验结果表明,所提算法能有效地利用少量的成对约束先验信息提高聚类效果。

关键词: 聚类,半监督,向量空间模型,成对约束,文本

Abstract: Semi-supervised clustering can use a small amount of tag data to improve the clustering performance,but most of the text clustering algorithms can not directly apply priori information such as pairwise constraints.As the characteristics of text data were high-dimensional and sparse,we proposed a semi-supervised document clustering algorithm.First,pairwise constraints were expanded and embedded in the document similarity matrix,then K density regions which have a small similarity with the already partitioned text collection were gradually searched in the remaining unpartitioned text collection as initial centroid.The remaining unpartitioned texts which are relatively difficult to distinguish were assigned to the K initial centroid according to the constraints.Finally,the clustering result was optimized by the convergence criterion function through integration of punish violations of pairwise constraints.In the clustering process,it can automatically determines the initial centroids to avoid the sensitivity to the initial centroids of K-means algorithm.Experimental results show that the proposed algorithm can effectively use a small amount of pairwise constraints to improve the clustering performance in Chinese and English text datasets.

Key words: Clustering,Semi-supervised,VSM,Pairwise constraints,Text

[1] Wagstaff K,Cardie C.Clustering with Instance-Level Const-raints[C]∥Proc of the17th International Conference on Machine Learning,San Francisco,USA.2000:1103-1110
[2] Li Xue-mei,Wang Li-hong,Song Yi-bin.A Hybrid Constrained Semi-Supervised Clustering Algorithm[J].Pattern Recognition and Artificial Intelligence,2011,4(4):452-456(in Chinese) 李雪梅,王立宏,宋宜斌.一种混合约束的半监督聚类算法[J].模式识别与人工智能,2011,4(4):452-456
[3] Li Kun-lun,Cao Zheng,Cao Li-ping,et al.Some Developments on Semi-Supervised Clustering[J].Pattern Recognition and Artificial Intelligence,2009,22(5):735-742(in Chinese) 李昆仑,曹铮,曹丽苹,等.半监督聚类的若干新进展[J].模式识别与人工智能,2009,22(5):735-742
[4] Wang Ling,Bo Lie-feng,Jiao Li-cheng.Density-Sensitive Semi-Supervised Spectral Clustering[J].Journal of Software,2007,18(10):2412-2422(in Chinese) 王玲,薄列峰,焦李成.密度敏感的半监督谱聚类[J].软件学报,2007,18(10):2412-2422
[5] Yin Xue-song,Hu En-liang,Chen Song-can.Discriminative Semi-Supervised Clustering Analysis with Pairwise Constraints[J].Journal of Software,2008,19(11):2791-2802(in Chinese) 尹学松,胡恩良,陈松灿.基于成对约束的判别型半监督聚类分析[J].软件学报,2008,19(11):2791-2802
[6] Xiao Yu,Yu Jian.Semi-Supervised Clustering Based on Affinity Propagation Algorithm[J].Journal of Software,2008,19 (11):2803-2813(in Chinese) 肖宇,于剑.基于近邻传播算法的半监督聚类[J].软件学报,2008,19(11):2803-2813
[7] Huang R Z,Lamb W.An active framework for semi-supervised document clustering with languages modeling[J].Data &Knowledge Engineering,2008,8(1):1136-1155
[8] Wang J LWu S Y,Vu H Q,et al.Text document clustering with metric learning[C]∥Proc.of the 33rd Annual Intl.ACM SIGIR Conf.on Research and Development in Information Retrieval.2010:783-784
[9] Shi Zhong.Semi-supervised model-based document clustering:A comparative study[J].Machine Learing,2006,5(3):3-29
[10] Higgs R E,Bemis K G,Watson I A,et al.Experi-mental designs for selecting molecules from large chemical databases[J].Journal of Chemical Information and Computer Sciences,1997,37(5):861-870
[11] Qin Yu,Jing Ji-wu,Xiang Ji,et al.An improved K-means algorithm based on optimizing initial points[J].Journal of the Gra-duate School of the Chinese Academy of Science,2007,24(6):771-777(in Chinese) 秦钰,荆继武,向继,等.基于优化初始类中心点的K-means改进算法[J].中国科学院研究生院学报,2007,24(6):771-777
[12] Pantel P,Lin D.Document clustering with committees[C]∥Proceedings of the 25 th International Conference on Research and Development in Information Retrieval.New York,NY,USA:ACM Press,2002:199-206
[13] Wang Zhong,Liu Gui-quan,Chen En-hong.A K-means Algo-rithm Based on Optimized Initial Center Points[J].Pattern Re-cognition and Artificial Intelligence,2009,22(2):299-304(in Chinese) 汪中,刘贵全,陈恩红.一种优化初始中心点的K-means算法[J].模式识别与人工智能,2009,22(2):299-304
[14] Klein D,Kamvar S D,Manning C.From instance-level con-straints to space-level constraints:Making the most of prior knowledge in data clustering[C]∥Proceedings of the Nineteenth International Conference on Machine Learning.2002:307-314
[15] Bilenko M,Basu S,Mooney R J.Integrating Constraints andMetric Learning in Semi-Supervised Clustering[C]∥Proc of the 21st International Conference on Machine Learning.2004:81-88
[16] Karypis G.CLUTO:a clustering toolkit.http://glaros.dtc.umn.edu/gkhome/cluto/cluto
[17] Wang Na,Li Xia.Active Semi-Supervised Spectral ClusteringBased on Pairwise Constraints[J].Acta Electronica Sinica,2010,38(1):172-176(in Chinese) 王娜,李霞.基于监督信息特性的主动半监督谱聚类算法[J].电子学报,2010,8(1):172-176
[18] Su Ying-bin,Du Xue-hui,et al.Sensitive Information Inference Method Based on Semi-supervised Document Clustering[J].Computer Science,2015,2(10):132-137(in Chinese) 苏嬴彬,杜学绘,等.基于半监督聚类的文档敏感信息推导方法[J].计算机科学,2015,2(10):132-137
[19] Zhao Wei-zhong,Ma Hui-fang,Li Zhi-qing,et al.Efficiently active learning for Semi-Supervised document clustering[J].Journal of Software,2012,3(6):1486-1499(in Chinese) 赵卫中,马慧芳,李志清,等.一种结合主动学习的半监督文档聚类算法[J].软件学报,2012,3(6):1486-1499
[20] Xiong S,Azimi J,Fem X Z.Active learing of constraints for se-mi-supervised clustering[J].IEEE Transactions on know-ledge and Data Engineering,2014,6(1):43-45

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75, 88 .
[2] 夏庆勋,庄毅. 一种基于局部性原理的远程验证机制[J]. 计算机科学, 2018, 45(4): 148 -151, 162 .
[3] 厉柏伸,李领治,孙涌,朱艳琴. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157 -162 .
[4] 王欢,张云峰,张艳. 一种基于CFDs规则的修复序列快速判定方法[J]. 计算机科学, 2018, 45(3): 311 -316 .
[5] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[6] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[7] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[8] 刘琴. 计算机取证过程中基于约束的数据质量问题研究[J]. 计算机科学, 2018, 45(4): 169 -172 .
[9] 钟菲,杨斌. 基于主成分分析网络的车牌检测方法[J]. 计算机科学, 2018, 45(3): 268 -273 .
[10] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99, 116 .