Computer Science ›› 2016, Vol. 43 ›› Issue (12): 229-233.doi: 10.11896/j.issn.1002-137X.2016.12.042

Previous Articles     Next Articles

Term and Semantic Difference Metric Based Document Clustering Algorithm

WEI Lin-jing, LIAN Zhi-chao, WANG Lian-guo and HOU Zhen-xing   

  • Online:2018-12-01 Published:2018-12-01

Abstract: The existing document clustering algorithms are based on the common similarity measurement,but ignore the semantics.So a document clustering algorithm based on maximizing the sum of the discrimination information provided by documents was proposed.Firstly,the discrimination information of term for the corresponding cluster and for the other clusters was analyzed separately,and the data set was transformed from input space to the difference scores matrix space.Then a greedy algorithm was designed to filter the terms with low score from each row of the matrix.Lastly,maximum likelihood estimation was used to smooth the document difference information.Simulation experiment results show that the proposed method has better cluster quality than the plat and hierarchical clustering algorithms,and has a good quality in interpretability and convergence.

Key words: Document clustering,Semantic analysis,Greedy algorithm,Convergence,Interpretability

[1] Zhao Wei-zhong,Ma Hui-fang,Li Zhi-Qing,et al.Efficiently Active Learning for Semi-Supervised Document Clustering[J].Journal of Software,2012,3(6):1486-1499(in Chinese) 赵卫中,马慧芳,李志清,等.一种结合主动学习的半监督文档聚类算法[J].软件学报,2012,23(6):1486-1499
[2] Liu Zhen-lu,Wang Da-ling,Feng Shi,et al.An Approach of Latent Semantic Space Partition and Web Document Clustering[J].Journal of Chinese Information Processing,2011,5(1):60-65(in Chinese) 刘振鹿,王大玲,冯时,等.一种基于LDA的潜在语义区划分及Web文档聚类算法[J].中文信息学报,2011,25(1):60-65
[3] Hsieh D A,Manski C F,Mcfadden D.Estimation of Response Probabilities From Augmented Retrospective Observations[J].Journal of the American Statistical Association,1985,80(391):651-662
[4] Junejo K N,Karim A.Robust personalizable spam filtering via local and global discrimination modeling[J].Knowledge & Information Systems,2013,34(2):299-334
[5] Mee C Y,Yun L J.A Corpus-based Approach to Comparative Evaluation of Statistical Term Association Measures[J].Journal of the American Society for Information Science & Technology,2001,52(4):283-296
[6] Junejo K N,Karim A.A Robust Discriminative Term Weighting Based Linear Discriminant Method for Text Classification[C]∥Eighth IEEE International Conference on Data Mining,2008(ICDM’08).IEEE,2008:323-332
[7] Malik H H,Fradkin D,Moerchen F.Single pass text classification by direct feature weighting[J].Knowledge & Information Systems,2011,28(1):79-98
[8] Cai D.An Information-Theoretic Foundation for the Measurement of Discrimination Information[J].IEEE Transactions on Knowledge & Data Engineering,2010,22(9):1262-1273
[9] Xu Z,Luo X,Mei L,et al.Measuring the semantic discrimination capability of association relations[J].Concurrency & Computation Practice & Experience,2014,26(2):380-395
[10] Morris J,Hirst G.Non-classical lexical semantic relations[C]∥Htl-naacl Workshop on Computational Lexical Semantics.2004:46-51
[11] Gil-Garcia R,Pons-Porrata A.Dynamic hierarchical algorithms for document clustering[J].Pattern Recognition Letters,2010,31(6):469-477
[12] Chen C L,Tseng F S C,Liang T.Mining fuzzy frequent itemsets for hierarchical document clustering[J].Information Processing &Management,2010,46(2):193-211
[13] Kuang D,Park H.Fast rank-2 nonnegative matrix factorization for hierarchical document clustering[C]∥Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2013:739-747
[14] Jaiswal A,Janwe N J.Fuzzy Association Rule Mining Algorithm to Generate Candidate Cluster:An Approach to Hierarchical Document Clustering[J].International Journal of Computer Scie-nce Issues,2012,9(2)
[15] Kiran K N,Santosh G S K,Varma V.Multilingual Document Clustering Using Wikipedia as External Knowledge[M]∥Multidisciplinary Information Retrieval.Springer Berlin Heidelberg,2011:108-117
[16] Nasir J A,Varlamis I,Karim A,et al.Semantic smoothing fortext clustering[J].Knowledge-Based Systems,2013,54(4):216-229
[17] Xu Chen-kai,Gao Mao-ting.Improved ART2 neural network fortext clustering based on LSA[J].Computer Engineering and Applications,2014,2(24):133-138,7(in Chinese) 徐晨凯,高茂庭.使用LSA降维的改进ART2神经网络文本聚类[J].计算机工程与应用,2015,2(24):133-138,177
[18] Li H,Li J,Wong L,et al.Relative Risk and Odds Ratio:A Data Mining Perspective(Corrected Version)[C]∥PODS’05.2005:368-377
[19] Gale W A,Sampson G.Good-turing frequency estimation without tears[J].Journal of Quantitative Linguistics,1995,2(3):217-237
[20] Chen W Y,Song Y,Bai H,et al.Parallel spectral clustering in distributed systems[J].IEEE Transactions on Software Engineering,2011,33(3):568-586
[21] Kim C W,Sun P.Enhancing Text Document Clustering UsingNon-negative Matrix Factorization and WordNet[J].Journal of Information & Communication Convergence Engineering,2013,11(4):241-246
[22] Kuang D,Park H.Fast rank-2 nonnegative matrix factorization for hierarchical document clustering[C]∥Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2013:739-747
[23] Huang Xian-ying,Liu Ying-tao,Rao Qin-fei.Similarity Algo-rithm Based on Common Chunks Between English Short Texts[J].Journal of Chongqing University of Technology(Natural Science),2015,9(8):88-93(in Chinese) 黄贤英,刘英涛,饶勤菲.一种基于公共词块的英文短文本相似度算法[J].重庆理工大学学报(自然科学版),2015,29(8):88-93

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!