Computer Science ›› 2015, Vol. 42 ›› Issue (6): 223-227.doi: 10.11896/j.issn.1002-137X.2015.06.047

Previous Articles     Next Articles

Categorical Incremental Data Labeling Algorithm

LI Yan-hong, LI De-yu and WANG Su-ge   

  • Online:2018-11-14 Published:2018-11-14

Abstract: Data labeling has become a simple but efficient solution to improve the efficiency of incremental data clustering.This process of data labeling is performed by assigning each new coming data point to some cluster that is closest to the new data point.One of the main difficulties in categorical data analysis is,however,lacking an appropriate way to define the similarity between data point and cluster.To overcome this difficulty,in this paper,we defined the representative of a cluster as a list of all attribute values with their frequencies in each attribute domain of the cluster,and then,defined the point-cluster dissimilarity measure by means of the change of information entropy.Based on the dissimilarity measure,we designed a categorical incremental data labeling algorithm,to allocate each unlabeled data point into the appropriate cluster.Comparative experiments on several public data sets and a text corpus show that the proposed algorithm has not only the higher labeling accuracy and the less execution time,but also better scalability.

Key words: Clustering,Data labeling,Incremental data,Categorical data,Information entropy

[1] Jain A K,Murty M N,Flynn P J.Data clustering:A review[J].ACM Computing Surveys,1999,31(3):264-323
[2] Zhang T,Ramakrishnan R,Livny M.Birch:An efficient dataclustering method for very large databases[C]∥Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.Montreal:ACM Press,1996:103-114
[3] Guha S,Rastogi R,Shim K.Cure:An efficient clustering algorithm for large databases[C]∥Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data.Seat-tle:ACM Press,1998:73-84
[4] Cao F,Liang J,Bai L,et al.A framework for clustering categorical time-evolving data[J].IEEE Transactions on Fuzzy Systems,2010,18(5):872-882
[5] Chen H L,Chen M S,Lin S C.Catching the trend:A framework for clustering concept-drifting categorical data[J].IEEE Tran-sactions on Knowledge and Data Engineering,2009,21(5):652-665
[6] Cao F,Liang J.A data labeling method for clustering categorical data[J].Expert Systems with Applications,2011,8(3):2381-2385
[7] 孟静,吴锡生.一种基于聚类和快速计算的异常数据挖掘算法[J].计算机工程,2013,9(8):60-63,8 Meng Jing,Wu Xi-sheng.An Outlier Data Mining Algorithm Based on Clustering and Rapid Calculation[J].Computer Engineering,2013,39(8):60-63,8
[8] 刘波,潘久辉.基于群体智能的增量数据挖掘方法研究[J].计算机工程与设计,2006,7(11):180-186 Liu Bo,Pan Jiu-hui.Research of incremental data mining based on swarm intelligence[J].Computer Engineering and Design,2006,7(11):180-186
[9] 胡开明,陈建华.一种改进的增量数据挖掘算法[J].计算机应用与软件,2011,8(8):260-264 Hu Kai-ming,Chen Jian-hua.An improved algorithm for incremental data mining[J].Computer Applications and Software,2011,8(8):260-264
[10] 宋中山,成林辉,吴立峰.一种基于关联规则的增量数据挖掘算法[J].湖北大学学报,2006,8(3):240-243 Song Zhong-shan,Cheng Lin-hui,WU Li-feng.The incremental data mining algorithm based on association rules[J].Journal of Hubei University,2006,8(3):240-243
[11] 李德玉,翁小奎,李艳红.基于用户兴趣域的混合数据聚类标签算法[J].山西大学学报,2013,6(2):180-186 Li De-yu,Weng Xiao-kui,Li Yan-hong.Mixed Data clustering label algorithm based on user’s interest domain[J].Journal of Shanxi University,2013,6(2):180-186
[12] Huang Zhe-xue.Extensions to the k-means algorithm for clustering large data sets with categorical values[J].Data Mining and Knowledge Discovery,1998,2(3):283-304
[13] Cover T M,Thomas J A.Elements of Information Theory(2nd Edition)[M].Hoboken:Wiley,2006:13-30
[14] 赵志刚,吴鑫,洪丹枫,等.基于信息熵的GLBP掌纹识别算法[J].计算机科学,2014,1(8):293-296 Zhao Zhi-gang,Wu Xin,Hong Dan-feng,et al.Palmprint Recognition Method Based on Energy Spectrum of GLBP [J].Computer Science,2006,27(11):180-186
[15] Frank A,Asuncion A.UCI Machine Learning Repository.2010.

No related articles found!
Full text



No Suggested Reading articles found!