Computer Science ›› 2016, Vol. 43 ›› Issue (1): 246-250, 269.doi: 10.11896/j.issn.1002-137X.2016.01.053

Text Clustering Method Study Based on MapReduce

LI Zhao, LI Xiao, WANG Chun-mei, LI Cheng and YANG Chun   

  • Online:2018-12-01 Published:2018-12-01

Abstract: Text clustering is the key technology of text organization,information extraction and topic retrieval.Appropriate similarity measure selection is an important task of clustering,which has great affection on the clustering results.Classical similarity measures,such as distance function and the correlation coefficient,can only describe the linear relationship between documents.However,clustering results based on classical clustering methods are usually unsatisfactory due to the complicated relationship among text documents.Some complicated clustering methods have been studied.But,with the growing scale of text data,the computational cost increases markedly with the increase of dataset size.Classical clustering methods are out of work in dealing with large scale dataset clustering problems.In this paper,a distributed clustering method based on MapReduce was proposed to deal with large scale text clustering.Furthermore,we proposed an improved version of k-means algorithm,which utilizes information loss as the similarity function.For improving clustering speed,parallel PCA method based on MapReduce was used to reduce the document vector dimension.The experimental results demonstrate that the proposed method is more efficient for text clustering than classic clustering methods.

Key words: Text clustering,MapReduce,K-means,Information loss

