计算机科学 ›› 2016, Vol. 43 ›› Issue (12): 229-233.doi: 10.11896/j.issn.1002-137X.2016.12.042

• 数据挖掘 • 上一篇    下一篇

基于词条与语意差异度量的文档聚类算法

魏霖静,练智超,王联国,侯振兴   

  1. 甘肃农业大学信息科学技术学院 兰州730070,南京理工大学计算机科学与工程学院 南京210094,甘肃农业大学信息科学技术学院 兰州730070,南京大学信息管理学院 南京210093
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金项目(034031122,61063028),江苏省自然科学基金青年基金(BK20150784),中国博士后面上资助

Term and Semantic Difference Metric Based Document Clustering Algorithm

WEI Lin-jing, LIAN Zhi-chao, WANG Lian-guo and HOU Zhen-xing   

  • Online:2018-12-01 Published:2018-12-01

摘要: 已有的文本聚类算法大多基于一般的相似性度量而忽略了语义内容,对此提出一种基于最大化文本判别信息的文本聚类算法。首先,分别分析词条对其类簇与其他类簇的判别信息,并且将数据集从输入空间转换至差异分数矩阵空间;然后,设计了一个贪婪算法来筛选矩阵每行的低分数词条;最终,采用最大似然估计对文本差别信息进行平滑处理。仿真实验结果表明,所提方法的文档聚类质量优于其他分层与单层聚类算法,并且具有较好的可解释性与收敛性。

关键词: 文档聚类,语意分析,贪婪算法,收敛性,可解释性

Abstract: The existing document clustering algorithms are based on the common similarity measurement,but ignore the semantics.So a document clustering algorithm based on maximizing the sum of the discrimination information provided by documents was proposed.Firstly,the discrimination information of term for the corresponding cluster and for the other clusters was analyzed separately,and the data set was transformed from input space to the difference scores matrix space.Then a greedy algorithm was designed to filter the terms with low score from each row of the matrix.Lastly,maximum likelihood estimation was used to smooth the document difference information.Simulation experiment results show that the proposed method has better cluster quality than the plat and hierarchical clustering algorithms,and has a good quality in interpretability and convergence.

Key words: Document clustering,Semantic analysis,Greedy algorithm,Convergence,Interpretability

[1] Zhao Wei-zhong,Ma Hui-fang,Li Zhi-Qing,et al.Efficiently Active Learning for Semi-Supervised Document Clustering[J].Journal of Software,2012,3(6):1486-1499(in Chinese) 赵卫中,马慧芳,李志清,等.一种结合主动学习的半监督文档聚类算法[J].软件学报,2012,23(6):1486-1499
[2] Liu Zhen-lu,Wang Da-ling,Feng Shi,et al.An Approach of Latent Semantic Space Partition and Web Document Clustering[J].Journal of Chinese Information Processing,2011,5(1):60-65(in Chinese) 刘振鹿,王大玲,冯时,等.一种基于LDA的潜在语义区划分及Web文档聚类算法[J].中文信息学报,2011,25(1):60-65
[3] Hsieh D A,Manski C F,Mcfadden D.Estimation of Response Probabilities From Augmented Retrospective Observations[J].Journal of the American Statistical Association,1985,80(391):651-662
[4] Junejo K N,Karim A.Robust personalizable spam filtering via local and global discrimination modeling[J].Knowledge & Information Systems,2013,34(2):299-334
[5] Mee C Y,Yun L J.A Corpus-based Approach to Comparative Evaluation of Statistical Term Association Measures[J].Journal of the American Society for Information Science & Technology,2001,52(4):283-296
[6] Junejo K N,Karim A.A Robust Discriminative Term Weighting Based Linear Discriminant Method for Text Classification[C]∥Eighth IEEE International Conference on Data Mining,2008(ICDM’08).IEEE,2008:323-332
[7] Malik H H,Fradkin D,Moerchen F.Single pass text classification by direct feature weighting[J].Knowledge & Information Systems,2011,28(1):79-98
[8] Cai D.An Information-Theoretic Foundation for the Measurement of Discrimination Information[J].IEEE Transactions on Knowledge & Data Engineering,2010,22(9):1262-1273
[9] Xu Z,Luo X,Mei L,et al.Measuring the semantic discrimination capability of association relations[J].Concurrency & Computation Practice & Experience,2014,26(2):380-395
[10] Morris J,Hirst G.Non-classical lexical semantic relations[C]∥Htl-naacl Workshop on Computational Lexical Semantics.2004:46-51
[11] Gil-Garcia R,Pons-Porrata A.Dynamic hierarchical algorithms for document clustering[J].Pattern Recognition Letters,2010,31(6):469-477
[12] Chen C L,Tseng F S C,Liang T.Mining fuzzy frequent itemsets for hierarchical document clustering[J].Information Processing &Management,2010,46(2):193-211
[13] Kuang D,Park H.Fast rank-2 nonnegative matrix factorization for hierarchical document clustering[C]∥Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2013:739-747
[14] Jaiswal A,Janwe N J.Fuzzy Association Rule Mining Algorithm to Generate Candidate Cluster:An Approach to Hierarchical Document Clustering[J].International Journal of Computer Scie-nce Issues,2012,9(2)
[15] Kiran K N,Santosh G S K,Varma V.Multilingual Document Clustering Using Wikipedia as External Knowledge[M]∥Multidisciplinary Information Retrieval.Springer Berlin Heidelberg,2011:108-117
[16] Nasir J A,Varlamis I,Karim A,et al.Semantic smoothing fortext clustering[J].Knowledge-Based Systems,2013,54(4):216-229
[17] Xu Chen-kai,Gao Mao-ting.Improved ART2 neural network fortext clustering based on LSA[J].Computer Engineering and Applications,2014,2(24):133-138,7(in Chinese) 徐晨凯,高茂庭.使用LSA降维的改进ART2神经网络文本聚类[J].计算机工程与应用,2015,2(24):133-138,177
[18] Li H,Li J,Wong L,et al.Relative Risk and Odds Ratio:A Data Mining Perspective(Corrected Version)[C]∥PODS’05.2005:368-377
[19] Gale W A,Sampson G.Good-turing frequency estimation without tears[J].Journal of Quantitative Linguistics,1995,2(3):217-237
[20] Chen W Y,Song Y,Bai H,et al.Parallel spectral clustering in distributed systems[J].IEEE Transactions on Software Engineering,2011,33(3):568-586
[21] Kim C W,Sun P.Enhancing Text Document Clustering UsingNon-negative Matrix Factorization and WordNet[J].Journal of Information & Communication Convergence Engineering,2013,11(4):241-246
[22] Kuang D,Park H.Fast rank-2 nonnegative matrix factorization for hierarchical document clustering[C]∥Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2013:739-747
[23] Huang Xian-ying,Liu Ying-tao,Rao Qin-fei.Similarity Algo-rithm Based on Common Chunks Between English Short Texts[J].Journal of Chongqing University of Technology(Natural Science),2015,9(8):88-93(in Chinese) 黄贤英,刘英涛,饶勤菲.一种基于公共词块的英文短文本相似度算法[J].重庆理工大学学报(自然科学版),2015,29(8):88-93

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!