计算机科学 ›› 2016, Vol. 43 ›› Issue (12): 229-233.doi: 10.11896/j.issn.1002-137X.2016.12.042
WEI Lin-jing, LIAN Zhi-chao, WANG Lian-guo and HOU Zhen-xing
摘要: 已有的文本聚类算法大多基于一般的相似性度量而忽略了语义内容,对此提出一种基于最大化文本判别信息的文本聚类算法。首先,分别分析词条对其类簇与其他类簇的判别信息,并且将数据集从输入空间转换至差异分数矩阵空间;然后,设计了一个贪婪算法来筛选矩阵每行的低分数词条;最终,采用最大似然估计对文本差别信息进行平滑处理。仿真实验结果表明,所提方法的文档聚类质量优于其他分层与单层聚类算法,并且具有较好的可解释性与收敛性。
[1] Zhao Wei-zhong,Ma Hui-fang,Li Zhi-Qing,et al.Efficiently Active Learning for Semi-Supervised Document Clustering[J].Journal of Software,2012,3(6):1486-1499(in Chinese) 赵卫中,马慧芳,李志清,等.一种结合主动学习的半监督文档聚类算法[J].软件学报,2012,23(6):1486-1499 [2] Liu Zhen-lu,Wang Da-ling,Feng Shi,et al.An Approach of Latent Semantic Space Partition and Web Document Clustering[J].Journal of Chinese Information Processing,2011,5(1):60-65(in Chinese) 刘振鹿,王大玲,冯时,等.一种基于LDA的潜在语义区划分及Web文档聚类算法[J].中文信息学报,2011,25(1):60-65 [3] Hsieh D A,Manski C F,Mcfadden D.Estimation of Response Probabilities From Augmented Retrospective Observations[J].Journal of the American Statistical Association,1985,80(391):651-662 [4] Junejo K N,Karim A.Robust personalizable spam filtering via local and global discrimination modeling[J].Knowledge & Information Systems,2013,34(2):299-334 [5] Mee C Y,Yun L J.A Corpus-based Approach to Comparative Evaluation of Statistical Term Association Measures[J].Journal of the American Society for Information Science & Technology,2001,52(4):283-296 [6] Junejo K N,Karim A.A Robust Discriminative Term Weighting Based Linear Discriminant Method for Text Classification[C]∥Eighth IEEE International Conference on Data Mining,2008(ICDM’08).IEEE,2008:323-332 [7] Malik H H,Fradkin D,Moerchen F.Single pass text classification by direct feature weighting[J].Knowledge & Information Systems,2011,28(1):79-98 [8] Cai D.An Information-Theoretic Foundation for the Measurement of Discrimination Information[J].IEEE Transactions on Knowledge & Data Engineering,2010,22(9):1262-1273 [9] Xu Z,Luo X,Mei L,et al.Measuring the semantic discrimination capability of association relations[J].Concurrency & Computation Practice & Experience,2014,26(2):380-395 [10] Morris J,Hirst G.Non-classical lexical semantic relations[C]∥Htl-naacl Workshop on Computational Lexical Semantics.2004:46-51 [11] Gil-Garcia R,Pons-Porrata A.Dynamic hierarchical algorithms for document clustering[J].Pattern Recognition Letters,2010,31(6):469-477 [12] Chen C L,Tseng F S C,Liang T.Mining fuzzy frequent itemsets for hierarchical document clustering[J].Information Processing &Management,2010,46(2):193-211 [13] Kuang D,Park H.Fast rank-2 nonnegative matrix factorization for hierarchical document clustering[C]∥Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2013:739-747 [14] Jaiswal A,Janwe N J.Fuzzy Association Rule Mining Algorithm to Generate Candidate Cluster:An Approach to Hierarchical Document Clustering[J].International Journal of Computer Scie-nce Issues,2012,9(2) [15] Kiran K N,Santosh G S K,Varma V.Multilingual Document Clustering Using Wikipedia as External Knowledge[M]∥Multidisciplinary Information Retrieval.Springer Berlin Heidelberg,2011:108-117 [16] Nasir J A,Varlamis I,Karim A,et al.Semantic smoothing fortext clustering[J].Knowledge-Based Systems,2013,54(4):216-229 [17] Xu Chen-kai,Gao Mao-ting.Improved ART2 neural network fortext clustering based on LSA[J].Computer Engineering and Applications,2014,2(24):133-138,7(in Chinese) 徐晨凯,高茂庭.使用LSA降维的改进ART2神经网络文本聚类[J].计算机工程与应用,2015,2(24):133-138,177 [18] Li H,Li J,Wong L,et al.Relative Risk and Odds Ratio:A Data Mining Perspective(Corrected Version)[C]∥PODS’05.2005:368-377 [19] Gale W A,Sampson G.Good-turing frequency estimation without tears[J].Journal of Quantitative Linguistics,1995,2(3):217-237 [20] Chen W Y,Song Y,Bai H,et al.Parallel spectral clustering in distributed systems[J].IEEE Transactions on Software Engineering,2011,33(3):568-586 [21] Kim C W,Sun P.Enhancing Text Document Clustering UsingNon-negative Matrix Factorization and WordNet[J].Journal of Information & Communication Convergence Engineering,2013,11(4):241-246 [22] Kuang D,Park H.Fast rank-2 nonnegative matrix factorization for hierarchical document clustering[C]∥Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2013:739-747 [23] Huang Xian-ying,Liu Ying-tao,Rao Qin-fei.Similarity Algo-rithm Based on Common Chunks Between English Short Texts[J].Journal of Chongqing University of Technology(Natural Science),2015,9(8):88-93(in Chinese) 黄贤英,刘英涛,饶勤菲.一种基于公共词块的英文短文本相似度算法[J].重庆理工大学学报(自然科学版),2015,29(8):88-93 |
No related articles found! |