计算机科学 ›› 2016, Vol. 43 ›› Issue (Z11): 443-446.doi: 10.11896/j.issn.1002-137X.2016.11A.099

• 信息安全 • 上一篇    下一篇

一种结合上下文语义的短文本聚类算法

张群,王红军,王伦文   

  1. 电子工程学院 合肥230037,电子工程学院 合肥230037,电子工程学院 合肥230037
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金(61273302)资助

Short Text Clustering Algorithm Combined with Context Semantic Information

ZHANG Qun, WANG Hong-jun and WANG Lun-wen   

  • Online:2018-12-01 Published:2018-12-01

摘要: 短文本因具有特征信息不足且高维稀疏等特点,使得传统文本聚类算法应用于短文本聚类任务时性能有限。针对上述情况,提出一种结合上下文语义的短文本聚类算法。首先借鉴社会网络分析领域的中心性和权威性思想设计了一种结合上下文语义的特征词权重计算方法,在此基础上构建词条-文本矩阵;然后对该矩阵进行奇异值分解,进一步将原始特征词空间映射到低维的潜在语义空间;最后通过改进的K-means聚类算法在低维潜在语义空间完成短文本聚类。实验结果表明,与传统的基于词频及逆向文档频权重的文本聚类算法相比,该算法能有效改善短文本特征不足及高维稀疏性,提高了短文的本聚类效果。

关键词: 短文本聚类,上下文语义,奇异值分解,K均值算法

Abstract: Because short text faces the challenges of information insufficiency,high dimensions and feature sparsity,conventional text clustering method has limited effect when applied to short text.In view of above,this paper proposed a novel short text clustering algorithm combined with the context semantic information.Firstly,drawing lessons from the idea of centrality and prestige in the field of social network analysis,the algorithm improved conventional feature weight calculation by considering the semantic information in the context.And on this basis,it constructs the term-document matrix and then carried out the singular value decomposition on the matrix to map the original high dimensional term vector space to the lower dimensional latent semantic space.Finally it clusters the short text on the lower dimensional latent semantic space by the improved K-means clustering algorithm.Experimental results show that using our scheme can effectively improve the characteristics of information insufficiency,high dimensions and feature sparsity of short text compared to the traditional text clustering method,and greatly improve the evaluation indicators of short text clustering.

Key words: Short text clustering,Context semantic information,Singular value decomposition,K-means clustering algorithm

[1] 孟宪军.互联网文本聚类与检索技术研究[D].哈尔滨:哈尔滨工业大学,2009
[2] 王仲远,程健鹏,王海勋,等.短文本理解研究[J].计算机研究与发展,2016,3(2):262-269
[3] 彭泽映,俞晓明,许洪波,等.大规模短文本的不完全聚类[J].中文信息学报,2011,5(1):54-59
[4] 程传鹏,苏安婕.一种短文本特征词提取的方法[J].计算机应用与软件,2014,1(6):162-164
[5] Jing Li-ping,Ng M K,Huang J Z.Knowledge-based vector space model for text clustering [J].Knowledge and Information Systems,2010,5(1):35-55
[6] 刘海峰,刘守生,张学仁.聚类模式下一种优化的K-means文本特征选择[J].计算机科学,2011,8(1):195-197
[7] 雷军程,黄同成,柳小文.一种基于权重的文本特征选择方法[J].计算机科学,2012,9(7):250-252,5
[8] 张保富,施化吉,马素琴.基于TFIDF文本特征加权方法的改进研究[J].计算机应用与软件,2011,8(2):17-20
[9] 朱征宇,孙俊华.改进的基于知网的词汇语义相似度计算[J].计算机应用,2013,3(8):2276-2279,8
[10] 王荣波,谌志群,周建政,等.基于Wikipedia的短文本语义相关度计算方法[J].计算机应用与软件,2015,2(1):82-85,2
[11] 宁亚辉,樊兴华,吴渝.基于领域词语本体的短文本分类[J].计算机科学,2009,6(3):142-145
[12] Batet M.Ontology-based semantic clustering[J].Ai Communications,2011,24(3):291-292
[13] 强保华,李巍,邹显春,等.基于潜在语义分析的Deep Web查询接口聚类研究[J].计算机科学,2013,0(11):228-230,7
[14] Xia Yan,Hua Zhao.Chinese Microblog Topic Detection Basedon the Latent Semantic Analysis and Structural Property [J].Journal of Networks,2013,8(4):917-923
[15] Dumais S T.Latent semantic analysis [J].Annual Review of Information Science & Technology,2008,3(11):188-230

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!