Computer Science ›› 2016, Vol. 43 ›› Issue (Z11): 443-446.doi: 10.11896/j.issn.1002-137X.2016.11A.099

Previous Articles     Next Articles

Short Text Clustering Algorithm Combined with Context Semantic Information

ZHANG Qun, WANG Hong-jun and WANG Lun-wen   

  • Online:2018-12-01 Published:2018-12-01

Abstract: Because short text faces the challenges of information insufficiency,high dimensions and feature sparsity,conventional text clustering method has limited effect when applied to short text.In view of above,this paper proposed a novel short text clustering algorithm combined with the context semantic information.Firstly,drawing lessons from the idea of centrality and prestige in the field of social network analysis,the algorithm improved conventional feature weight calculation by considering the semantic information in the context.And on this basis,it constructs the term-document matrix and then carried out the singular value decomposition on the matrix to map the original high dimensional term vector space to the lower dimensional latent semantic space.Finally it clusters the short text on the lower dimensional latent semantic space by the improved K-means clustering algorithm.Experimental results show that using our scheme can effectively improve the characteristics of information insufficiency,high dimensions and feature sparsity of short text compared to the traditional text clustering method,and greatly improve the evaluation indicators of short text clustering.

Key words: Short text clustering,Context semantic information,Singular value decomposition,K-means clustering algorithm

[1] 孟宪军.互联网文本聚类与检索技术研究[D].哈尔滨:哈尔滨工业大学,2009
[2] 王仲远,程健鹏,王海勋,等.短文本理解研究[J].计算机研究与发展,2016,3(2):262-269
[3] 彭泽映,俞晓明,许洪波,等.大规模短文本的不完全聚类[J].中文信息学报,2011,5(1):54-59
[4] 程传鹏,苏安婕.一种短文本特征词提取的方法[J].计算机应用与软件,2014,1(6):162-164
[5] Jing Li-ping,Ng M K,Huang J Z.Knowledge-based vector space model for text clustering [J].Knowledge and Information Systems,2010,5(1):35-55
[6] 刘海峰,刘守生,张学仁.聚类模式下一种优化的K-means文本特征选择[J].计算机科学,2011,8(1):195-197
[7] 雷军程,黄同成,柳小文.一种基于权重的文本特征选择方法[J].计算机科学,2012,9(7):250-252,5
[8] 张保富,施化吉,马素琴.基于TFIDF文本特征加权方法的改进研究[J].计算机应用与软件,2011,8(2):17-20
[9] 朱征宇,孙俊华.改进的基于知网的词汇语义相似度计算[J].计算机应用,2013,3(8):2276-2279,8
[10] 王荣波,谌志群,周建政,等.基于Wikipedia的短文本语义相关度计算方法[J].计算机应用与软件,2015,2(1):82-85,2
[11] 宁亚辉,樊兴华,吴渝.基于领域词语本体的短文本分类[J].计算机科学,2009,6(3):142-145
[12] Batet M.Ontology-based semantic clustering[J].Ai Communications,2011,24(3):291-292
[13] 强保华,李巍,邹显春,等.基于潜在语义分析的Deep Web查询接口聚类研究[J].计算机科学,2013,0(11):228-230,7
[14] Xia Yan,Hua Zhao.Chinese Microblog Topic Detection Basedon the Latent Semantic Analysis and Structural Property [J].Journal of Networks,2013,8(4):917-923
[15] Dumais S T.Latent semantic analysis [J].Annual Review of Information Science & Technology,2008,3(11):188-230

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!