Computer Science ›› 2016, Vol. 43 ›› Issue (12): 183-188.doi: 10.11896/j.issn.1002-137X.2016.12.033

Pairwise Constrained Semi-supervised Text Clustering Algorithm

WANG Zong-hu and LIU Su   

  • Online:2018-12-01 Published:2018-12-01

Abstract: Semi-supervised clustering can use a small amount of tag data to improve the clustering performance,but most of the text clustering algorithms can not directly apply priori information such as pairwise constraints.As the characteristics of text data were high-dimensional and sparse,we proposed a semi-supervised document clustering algorithm.First,pairwise constraints were expanded and embedded in the document similarity matrix,then K density regions which have a small similarity with the already partitioned text collection were gradually searched in the remaining unpartitioned text collection as initial centroid.The remaining unpartitioned texts which are relatively difficult to distinguish were assigned to the K initial centroid according to the constraints.Finally,the clustering result was optimized by the convergence criterion function through integration of punish violations of pairwise constraints.In the clustering process,it can automatically determines the initial centroids to avoid the sensitivity to the initial centroids of K-means algorithm.Experimental results show that the proposed algorithm can effectively use a small amount of pairwise constraints to improve the clustering performance in Chinese and English text datasets.

Key words: Clustering,Semi-supervised,VSM,Pairwise constraints,Text

