计算机科学 ›› 2015, Vol. 42 ›› Issue (6): 223-227.doi: 10.11896/j.issn.1002-137X.2015.06.047

• 人工智能 • 上一篇    下一篇

一种符号型增量数据标签算法

李艳红,李德玉,王素格   

  1. 山西大学计算机与信息技术学院 太原030006计算智能与中文信息处理教育部重点实验室 太原030006,山西大学计算机与信息技术学院 太原030006计算智能与中文信息处理教育部重点实验室 太原030006,山西大学计算机与信息技术学院 太原030006计算智能与中文信息处理教育部重点实验室 太原030006
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家自然科学基金(61272095,61175067,61303091,61202365,61100138,61403238),山西省自然科学基金(2012061015),山西省科技攻关项目(20110321027-02),山西省回国留学人员科研项目(2013-014)资助

Categorical Incremental Data Labeling Algorithm

LI Yan-hong, LI De-yu and WANG Su-ge   

  • Online:2018-11-14 Published:2018-11-14

摘要: 数据标签是一种提高增量数据聚类效率的简单而有效的方法。数据标签就是分配每个新增数据点到与之最相似的簇的过程。符号数据分析的难点之一在于缺少一种恰当的方法来定义数据点与数据簇之间的相似性。为此,将簇代表定义为簇中所有属性的属性值及其在簇中的频率构成的列表,用信息熵的变化来定义“点-簇”不相似性。基于此不相似性度量,设计了一个符号型增量数据标签算法来分配无标记数据到恰当的簇。在公开数据集和文本语料上的对比实验表明,该数据标签算法不但数据标记精度高、时间开销小,而且有较好的可伸缩性。

关键词: 聚类,数据标签,增量数据,符号数据,信息熵

Abstract: Data labeling has become a simple but efficient solution to improve the efficiency of incremental data clustering.This process of data labeling is performed by assigning each new coming data point to some cluster that is closest to the new data point.One of the main difficulties in categorical data analysis is,however,lacking an appropriate way to define the similarity between data point and cluster.To overcome this difficulty,in this paper,we defined the representative of a cluster as a list of all attribute values with their frequencies in each attribute domain of the cluster,and then,defined the point-cluster dissimilarity measure by means of the change of information entropy.Based on the dissimilarity measure,we designed a categorical incremental data labeling algorithm,to allocate each unlabeled data point into the appropriate cluster.Comparative experiments on several public data sets and a text corpus show that the proposed algorithm has not only the higher labeling accuracy and the less execution time,but also better scalability.

Key words: Clustering,Data labeling,Incremental data,Categorical data,Information entropy

[1] Jain A K,Murty M N,Flynn P J.Data clustering:A review[J].ACM Computing Surveys,1999,31(3):264-323
[2] Zhang T,Ramakrishnan R,Livny M.Birch:An efficient dataclustering method for very large databases[C]∥Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.Montreal:ACM Press,1996:103-114
[3] Guha S,Rastogi R,Shim K.Cure:An efficient clustering algorithm for large databases[C]∥Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data.Seat-tle:ACM Press,1998:73-84
[4] Cao F,Liang J,Bai L,et al.A framework for clustering categorical time-evolving data[J].IEEE Transactions on Fuzzy Systems,2010,18(5):872-882
[5] Chen H L,Chen M S,Lin S C.Catching the trend:A framework for clustering concept-drifting categorical data[J].IEEE Tran-sactions on Knowledge and Data Engineering,2009,21(5):652-665
[6] Cao F,Liang J.A data labeling method for clustering categorical data[J].Expert Systems with Applications,2011,8(3):2381-2385
[7] 孟静,吴锡生.一种基于聚类和快速计算的异常数据挖掘算法[J].计算机工程,2013,9(8):60-63,8 Meng Jing,Wu Xi-sheng.An Outlier Data Mining Algorithm Based on Clustering and Rapid Calculation[J].Computer Engineering,2013,39(8):60-63,8
[8] 刘波,潘久辉.基于群体智能的增量数据挖掘方法研究[J].计算机工程与设计,2006,7(11):180-186 Liu Bo,Pan Jiu-hui.Research of incremental data mining based on swarm intelligence[J].Computer Engineering and Design,2006,7(11):180-186
[9] 胡开明,陈建华.一种改进的增量数据挖掘算法[J].计算机应用与软件,2011,8(8):260-264 Hu Kai-ming,Chen Jian-hua.An improved algorithm for incremental data mining[J].Computer Applications and Software,2011,8(8):260-264
[10] 宋中山,成林辉,吴立峰.一种基于关联规则的增量数据挖掘算法[J].湖北大学学报,2006,8(3):240-243 Song Zhong-shan,Cheng Lin-hui,WU Li-feng.The incremental data mining algorithm based on association rules[J].Journal of Hubei University,2006,8(3):240-243
[11] 李德玉,翁小奎,李艳红.基于用户兴趣域的混合数据聚类标签算法[J].山西大学学报,2013,6(2):180-186 Li De-yu,Weng Xiao-kui,Li Yan-hong.Mixed Data clustering label algorithm based on user’s interest domain[J].Journal of Shanxi University,2013,6(2):180-186
[12] Huang Zhe-xue.Extensions to the k-means algorithm for clustering large data sets with categorical values[J].Data Mining and Knowledge Discovery,1998,2(3):283-304
[13] Cover T M,Thomas J A.Elements of Information Theory(2nd Edition)[M].Hoboken:Wiley,2006:13-30
[14] 赵志刚,吴鑫,洪丹枫,等.基于信息熵的GLBP掌纹识别算法[J].计算机科学,2014,1(8):293-296 Zhao Zhi-gang,Wu Xin,Hong Dan-feng,et al.Palmprint Recognition Method Based on Energy Spectrum of GLBP [J].Computer Science,2006,27(11):180-186
[15] Frank A,Asuncion A.UCI Machine Learning Repository.2010.http://archive.ics.uci.edu/ml

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!