计算机科学 ›› 2016, Vol. 43 ›› Issue (5): 243-246.doi: 10.11896/j.issn.1002-137X.2016.05.045
王有华,陈笑蓉
WANG You-hua and CHEN Xiao-rong
摘要: 基于Kolmogorov复杂性的聚类算法虽然具有普适性、参数无关性的优点,但是应用到文本内容语义信息聚类时往往准确率较低。针对这一问题,提出了一种基于特征扩展的文本聚类改进算法——DEF-KC算法。该算法通过引用百度百科中特定词条的信息,对预处理过的文本中的关键词进行特征扩展,从而提高特征词的主题贡献度,增强文本的结构辨识度,并通过选取特定压缩算法近似计算Kolmogorov复杂性得到文本相似度,最后使用谱聚类算法进行聚类。实验结果表明,与传统的基于Kolmogorov复杂性的文本聚类算法相比,使用该算法时聚类准确率和召回率均得到了较大提升。
[1] Ming Jun-ren.Research on Text Clustering Model Based on Ontology Graph[J].Information Science,2013,1(2):29-33(in Chinese) 明均仁.基于本体图的文本聚类模型研究[J].情报科学,2013,1(2):29-33 [2] Nikvand N,Wang Z.Generic image similarity based on Kolmo-gorov complexity[C]∥2010 17th IEEE International Conference on Image Processing(ICIP).IEEE,2010:309-312 [3] Zhang L,Zhuang Y,Yuan Z.A program plagiarism detection mo-del based on information distance and clustering[C]∥The 2007 International Conference on Intelligent Pervasive Computing,2007(IPC).IEEE,2007:431-436 [4] Ukil A.Application of Kolmogorov complexity in anomaly de-tection[C]∥2010 16th Asia-Pacific Conference on Communications(APCC).IEEE,2010:141-146 [5] Belabbes S,Richard G.On Using SVM and Kolmogorov Complexity for Spam Filtering[C]∥FLAIRS Conference.2008:130-135 [6] Geweniger T,Schleif F M,Hasenfuss A,et al.Comparison of cluster algorithms for the analysis of text data using kolmogorov complexity[M]∥Advances in Neuro-Information Processing.Springer Berlin Heidelberg,2009:61-69 [7] Vita,nyi,Paul M B.Information Distance in Multiples[J].IEEE Transactions on Information Theory ,2011,57(4):2451-2456 [8] Vitanyi P M B,Balbach F J,Cilibrasi R L,et al.Normalized information distance[M]∥Information Theory and Statistical Learning.Springer US,2009:45-82 [9] Tao Xiao-lei.Study of Kolmogorov Complexity Based Clustering Algorithms[D].Nanjing:Nanjing University of Aeronautics and Astronautics,2013(in Chinese) 陶小雷.基于 Kolmogorov 复杂性的聚类方法研究[D].南京:南京航空航天大学,2013 [10] Wang Hui-qing,Chen Jun-jie.Research of spectral clusteringbased on graph partition[J].Computer Engineering and Design,2011(1):289-292(in Chinese) 王会青,陈俊杰.基于图划分的谱聚类方法的研究[J].计算机工程与设计,2011(1):289-292 [11] Lv Chao-zhen,Ji Dong-hong,Wu Fei-fei.Short text classification based on expanding feature of LDA[J].Computer Engineering and Applications,2015(4):123-127(in Chinese) 吕超镇,姬东鸿,吴飞飞.基于LDA特征扩展的短文本分类[J].计算机工程与应用,2015(4):123-127 |
No related articles found! |
|