Computer Science ›› 2015, Vol. 42 ›› Issue (6): 223-227.doi: 10.11896/j.issn.1002-137X.2015.06.047

Categorical Incremental Data Labeling Algorithm

LI Yan-hong, LI De-yu and WANG Su-ge   

  • Online:2018-11-14 Published:2018-11-14

Abstract: Data labeling has become a simple but efficient solution to improve the efficiency of incremental data clustering.This process of data labeling is performed by assigning each new coming data point to some cluster that is closest to the new data point.One of the main difficulties in categorical data analysis is,however,lacking an appropriate way to define the similarity between data point and cluster.To overcome this difficulty,in this paper,we defined the representative of a cluster as a list of all attribute values with their frequencies in each attribute domain of the cluster,and then,defined the point-cluster dissimilarity measure by means of the change of information entropy.Based on the dissimilarity measure,we designed a categorical incremental data labeling algorithm,to allocate each unlabeled data point into the appropriate cluster.Comparative experiments on several public data sets and a text corpus show that the proposed algorithm has not only the higher labeling accuracy and the less execution time,but also better scalability.

Key words: Clustering,Data labeling,Incremental data,Categorical data,Information entropy

