Computer Science ›› 2016, Vol. 43 ›› Issue (1): 246-250, 269.doi: 10.11896/j.issn.1002-137X.2016.01.053

Previous Articles     Next Articles

Text Clustering Method Study Based on MapReduce

LI Zhao, LI Xiao, WANG Chun-mei, LI Cheng and YANG Chun   

  • Online:2018-12-01 Published:2018-12-01

Abstract: Text clustering is the key technology of text organization,information extraction and topic retrieval.Appropriate similarity measure selection is an important task of clustering,which has great affection on the clustering results.Classical similarity measures,such as distance function and the correlation coefficient,can only describe the linear relationship between documents.However,clustering results based on classical clustering methods are usually unsatisfactory due to the complicated relationship among text documents.Some complicated clustering methods have been studied.But,with the growing scale of text data,the computational cost increases markedly with the increase of dataset size.Classical clustering methods are out of work in dealing with large scale dataset clustering problems.In this paper,a distributed clustering method based on MapReduce was proposed to deal with large scale text clustering.Furthermore,we proposed an improved version of k-means algorithm,which utilizes information loss as the similarity function.For improving clustering speed,parallel PCA method based on MapReduce was used to reduce the document vector dimension.The experimental results demonstrate that the proposed method is more efficient for text clustering than classic clustering methods.

Key words: Text clustering,MapReduce,K-means,Information loss

[1] Zhang Ren-yuan,Shibata T.An analog on-line-learning K-means processor employing fully parallel self-converging circuitry[J].Analog Integrated Circuits and Signal Processing,2013,75(2):267-277
[2] Sathiyakumari K,Preamsudha V,Manimekalai G,et al.A Survey on Various Approaches in Document Clustering [J].International Journal of Computer Technology and Applications,2011,2(5):1534-1539
[3] Xiang Xiao-jun,Gao Yang,Shang Lin,et al.Parallel Text Categorization of Massive Text Based on Hadoop [J].Computer Sci-ence,2011,38(10):184-187(in Chinese)向小军,高阳,商琳,等.基于Hadoop平台的海量文本分类的并行化[J].计算机科学,2011,38(10):184-187
[4] Kannungo T,Mount D M,Netanyahu N S,et al.An Efficient K-Means Clustering Algorithm:Analysis And Implementation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2002,24(7):881-891
[5] Wang Da,Mazumdar A,Womell G W.A Rate-Distortion Theory For Permutation Spaces[C]∥IEEE International Symposium on Information Theory Proceedings.2013:2562-2566
[6] Sun Zhan-quan,Geoffrey F,Gu Wei-dong,et al.A parallel clustering method combined information bottleneck theory and centroid-based clustering[J].The Journal of Supercomputing,2014,69(1):452-467
[7] Lu Shi-jian,Chen Tao,Tian Shang-xuan,et al.Scene text extraction based on edges and support vector regression[J].International Journal on Document Analysis and Recognition,2015,18(2):125-135
[8] Bellot P,Bonnefoy L,Bouvier V,et al.Large Scale Text Mining Approaches for Information Retrieval and Extraction[M]∥Innovations in Intelligent Machines.2014:3-45
[9] Zhu Ye-xing,Li Yan-ling,Cui Meng-tian.Clustering Algorithm CARDBK Improved from K-means Algorithm [J].Computer Science,2015,42(3):201-205(in Chinese)朱烨行,李艳玲,崔梦天.一种改进K-means算法的聚类算法CARDBK[J].计算机科学,2015,42(3):201-205
[10] Brecheisen S,Krieegel H P,Kroger P,et al.Visually miningthrough cluster hierarchies[C]∥International Conference on Data Mining.Lake Buena Vista,FL,2004:400-412
[11] Dean J,Ghemawat S.MapReduce:Simplified data processing on large clusters[C]∥Proceedings of the 6th conference on Sympo-sium on Opearting Systems Design & Implementation,2004(6):137-150
[12] Lee K,Lee Y,Choi H.Parallel Data Processing with Map-Reduce:A Survey[J].ACM SIGMOD Record,2011,40(4):11-20
[13] Kanungo T,Mount M D,Neanyahu N S,et al.A Local Search Approximation Algorithm for k-Means Clustering[J].Computational Geometry Theory&Applications,2004,28(2):89-112
[14] Xiong Zhong-yang,Chen Ruo-tian,Zhang Yu-fang.Effectivemethod for cluster centers’ initialization in K-means clustering[J].Application Research of Computers,2011(11):4188-4189(in Chinese)熊忠阳,陈若田,张玉芳.一种有效的k-Means聚类中心初始化方法[J].计算机应用研究,2011(11):4188-4189
[15] Younis O,Fahmy S.HEED:A Hybrid,Energy-efficient,Distri-buted Clustering Approach for Ad Hoc Sensor Networks[J].IEEE Transactions on Mobile Computing,2004,3(4):366-379

No related articles found!
Full text



[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[2] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[3] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[4] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[5] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[6] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[7] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[8] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[9] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .
[10] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99, 116 .