计算机科学 ›› 2016, Vol. 43 ›› Issue (5): 243-246.doi: 10.11896/j.issn.1002-137X.2016.05.045

• 人工智能 • 上一篇    下一篇

基于Kolmogorov复杂性的文本聚类算法改进

王有华,陈笑蓉   

  1. 贵州大学计算机科学与技术学院 贵阳550025,贵州大学计算机科学与技术学院 贵阳550025
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金(61363028)资助

Improved Text Clustering Algorithm Based on Kolmogorov Complexity

WANG You-hua and CHEN Xiao-rong   

  • Online:2018-12-01 Published:2018-12-01

摘要: 基于Kolmogorov复杂性的聚类算法虽然具有普适性、参数无关性的优点,但是应用到文本内容语义信息聚类时往往准确率较低。针对这一问题,提出了一种基于特征扩展的文本聚类改进算法——DEF-KC算法。该算法通过引用百度百科中特定词条的信息,对预处理过的文本中的关键词进行特征扩展,从而提高特征词的主题贡献度,增强文本的结构辨识度,并通过选取特定压缩算法近似计算Kolmogorov复杂性得到文本相似度,最后使用谱聚类算法进行聚类。实验结果表明,与传统的基于Kolmogorov复杂性的文本聚类算法相比,使用该算法时聚类准确率和召回率均得到了较大提升。

关键词: Kolmogorov复杂性,文本聚类,特征扩展,谱聚类

Abstract: Clustering algorithm based on Kolmogorov complexity has the advantages of generality,parameter indepen-dence,but always shows low accuracy when applied to the text semantic information clustering.In order to solve this problem,this paper proposed a text clustering algorithm based on feature extension-DEF-KC.For improving keyword’stheme contribution,DEF-KC applies feature extension to the keyword in the pretreated text by referencing information of specific entry in a baidu encyclopedia,and calculates the text similarity by approximate Kolmogorov complexity of the text.Finally it clusters text using spectral clustering algorithm.The experimental results show that the proposed algorithm has much better accuracy and recall rate compared to the traditional text clustering algorithm based on Kolmogorov complexity.

Key words: Kolmogorov complexity,Text clustering,Feature extension,Spectral clustering

[1] Ming Jun-ren.Research on Text Clustering Model Based on Ontology Graph[J].Information Science,2013,1(2):29-33(in Chinese) 明均仁.基于本体图的文本聚类模型研究[J].情报科学,2013,1(2):29-33
[2] Nikvand N,Wang Z.Generic image similarity based on Kolmo-gorov complexity[C]∥2010 17th IEEE International Conference on Image Processing(ICIP).IEEE,2010:309-312
[3] Zhang L,Zhuang Y,Yuan Z.A program plagiarism detection mo-del based on information distance and clustering[C]∥The 2007 International Conference on Intelligent Pervasive Computing,2007(IPC).IEEE,2007:431-436
[4] Ukil A.Application of Kolmogorov complexity in anomaly de-tection[C]∥2010 16th Asia-Pacific Conference on Communications(APCC).IEEE,2010:141-146
[5] Belabbes S,Richard G.On Using SVM and Kolmogorov Complexity for Spam Filtering[C]∥FLAIRS Conference.2008:130-135
[6] Geweniger T,Schleif F M,Hasenfuss A,et al.Comparison of cluster algorithms for the analysis of text data using kolmogorov complexity[M]∥Advances in Neuro-Information Processing.Springer Berlin Heidelberg,2009:61-69
[7] Vita,nyi,Paul M B.Information Distance in Multiples[J].IEEE Transactions on Information Theory ,2011,57(4):2451-2456
[8] Vitanyi P M B,Balbach F J,Cilibrasi R L,et al.Normalized information distance[M]∥Information Theory and Statistical Learning.Springer US,2009:45-82
[9] Tao Xiao-lei.Study of Kolmogorov Complexity Based Clustering Algorithms[D].Nanjing:Nanjing University of Aeronautics and Astronautics,2013(in Chinese) 陶小雷.基于 Kolmogorov 复杂性的聚类方法研究[D].南京:南京航空航天大学,2013
[10] Wang Hui-qing,Chen Jun-jie.Research of spectral clusteringbased on graph partition[J].Computer Engineering and Design,2011(1):289-292(in Chinese) 王会青,陈俊杰.基于图划分的谱聚类方法的研究[J].计算机工程与设计,2011(1):289-292
[11] Lv Chao-zhen,Ji Dong-hong,Wu Fei-fei.Short text classification based on expanding feature of LDA[J].Computer Engineering and Applications,2015(4):123-127(in Chinese) 吕超镇,姬东鸿,吴飞飞.基于LDA特征扩展的短文本分类[J].计算机工程与应用,2015(4):123-127

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!