Computer Science ›› 2016, Vol. 43 ›› Issue (5): 243-246.doi: 10.11896/j.issn.1002-137X.2016.05.045

Previous Articles     Next Articles

Improved Text Clustering Algorithm Based on Kolmogorov Complexity

WANG You-hua and CHEN Xiao-rong   

  • Online:2018-12-01 Published:2018-12-01

Abstract: Clustering algorithm based on Kolmogorov complexity has the advantages of generality,parameter indepen-dence,but always shows low accuracy when applied to the text semantic information clustering.In order to solve this problem,this paper proposed a text clustering algorithm based on feature extension-DEF-KC.For improving keyword’stheme contribution,DEF-KC applies feature extension to the keyword in the pretreated text by referencing information of specific entry in a baidu encyclopedia,and calculates the text similarity by approximate Kolmogorov complexity of the text.Finally it clusters text using spectral clustering algorithm.The experimental results show that the proposed algorithm has much better accuracy and recall rate compared to the traditional text clustering algorithm based on Kolmogorov complexity.

Key words: Kolmogorov complexity,Text clustering,Feature extension,Spectral clustering

[1] Ming Jun-ren.Research on Text Clustering Model Based on Ontology Graph[J].Information Science,2013,1(2):29-33(in Chinese) 明均仁.基于本体图的文本聚类模型研究[J].情报科学,2013,1(2):29-33
[2] Nikvand N,Wang Z.Generic image similarity based on Kolmo-gorov complexity[C]∥2010 17th IEEE International Conference on Image Processing(ICIP).IEEE,2010:309-312
[3] Zhang L,Zhuang Y,Yuan Z.A program plagiarism detection mo-del based on information distance and clustering[C]∥The 2007 International Conference on Intelligent Pervasive Computing,2007(IPC).IEEE,2007:431-436
[4] Ukil A.Application of Kolmogorov complexity in anomaly de-tection[C]∥2010 16th Asia-Pacific Conference on Communications(APCC).IEEE,2010:141-146
[5] Belabbes S,Richard G.On Using SVM and Kolmogorov Complexity for Spam Filtering[C]∥FLAIRS Conference.2008:130-135
[6] Geweniger T,Schleif F M,Hasenfuss A,et al.Comparison of cluster algorithms for the analysis of text data using kolmogorov complexity[M]∥Advances in Neuro-Information Processing.Springer Berlin Heidelberg,2009:61-69
[7] Vita,nyi,Paul M B.Information Distance in Multiples[J].IEEE Transactions on Information Theory ,2011,57(4):2451-2456
[8] Vitanyi P M B,Balbach F J,Cilibrasi R L,et al.Normalized information distance[M]∥Information Theory and Statistical Learning.Springer US,2009:45-82
[9] Tao Xiao-lei.Study of Kolmogorov Complexity Based Clustering Algorithms[D].Nanjing:Nanjing University of Aeronautics and Astronautics,2013(in Chinese) 陶小雷.基于 Kolmogorov 复杂性的聚类方法研究[D].南京:南京航空航天大学,2013
[10] Wang Hui-qing,Chen Jun-jie.Research of spectral clusteringbased on graph partition[J].Computer Engineering and Design,2011(1):289-292(in Chinese) 王会青,陈俊杰.基于图划分的谱聚类方法的研究[J].计算机工程与设计,2011(1):289-292
[11] Lv Chao-zhen,Ji Dong-hong,Wu Fei-fei.Short text classification based on expanding feature of LDA[J].Computer Engineering and Applications,2015(4):123-127(in Chinese) 吕超镇,姬东鸿,吴飞飞.基于LDA特征扩展的短文本分类[J].计算机工程与应用,2015(4):123-127

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[2] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[3] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[4] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[5] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[6] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[7] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[8] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[9] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .
[10] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99, 116 .