计算机科学 ›› 2015, Vol. 42 ›› Issue (6): 223-227.doi: 10.11896/j.issn.1002-137X.2015.06.047
李艳红,李德玉,王素格
LI Yan-hong, LI De-yu and WANG Su-ge
摘要: 数据标签是一种提高增量数据聚类效率的简单而有效的方法。数据标签就是分配每个新增数据点到与之最相似的簇的过程。符号数据分析的难点之一在于缺少一种恰当的方法来定义数据点与数据簇之间的相似性。为此,将簇代表定义为簇中所有属性的属性值及其在簇中的频率构成的列表,用信息熵的变化来定义“点-簇”不相似性。基于此不相似性度量,设计了一个符号型增量数据标签算法来分配无标记数据到恰当的簇。在公开数据集和文本语料上的对比实验表明,该数据标签算法不但数据标记精度高、时间开销小,而且有较好的可伸缩性。
[1] Jain A K,Murty M N,Flynn P J.Data clustering:A review[J].ACM Computing Surveys,1999,31(3):264-323 [2] Zhang T,Ramakrishnan R,Livny M.Birch:An efficient dataclustering method for very large databases[C]∥Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.Montreal:ACM Press,1996:103-114 [3] Guha S,Rastogi R,Shim K.Cure:An efficient clustering algorithm for large databases[C]∥Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data.Seat-tle:ACM Press,1998:73-84 [4] Cao F,Liang J,Bai L,et al.A framework for clustering categorical time-evolving data[J].IEEE Transactions on Fuzzy Systems,2010,18(5):872-882 [5] Chen H L,Chen M S,Lin S C.Catching the trend:A framework for clustering concept-drifting categorical data[J].IEEE Tran-sactions on Knowledge and Data Engineering,2009,21(5):652-665 [6] Cao F,Liang J.A data labeling method for clustering categorical data[J].Expert Systems with Applications,2011,8(3):2381-2385 [7] 孟静,吴锡生.一种基于聚类和快速计算的异常数据挖掘算法[J].计算机工程,2013,9(8):60-63,8 Meng Jing,Wu Xi-sheng.An Outlier Data Mining Algorithm Based on Clustering and Rapid Calculation[J].Computer Engineering,2013,39(8):60-63,8 [8] 刘波,潘久辉.基于群体智能的增量数据挖掘方法研究[J].计算机工程与设计,2006,7(11):180-186 Liu Bo,Pan Jiu-hui.Research of incremental data mining based on swarm intelligence[J].Computer Engineering and Design,2006,7(11):180-186 [9] 胡开明,陈建华.一种改进的增量数据挖掘算法[J].计算机应用与软件,2011,8(8):260-264 Hu Kai-ming,Chen Jian-hua.An improved algorithm for incremental data mining[J].Computer Applications and Software,2011,8(8):260-264 [10] 宋中山,成林辉,吴立峰.一种基于关联规则的增量数据挖掘算法[J].湖北大学学报,2006,8(3):240-243 Song Zhong-shan,Cheng Lin-hui,WU Li-feng.The incremental data mining algorithm based on association rules[J].Journal of Hubei University,2006,8(3):240-243 [11] 李德玉,翁小奎,李艳红.基于用户兴趣域的混合数据聚类标签算法[J].山西大学学报,2013,6(2):180-186 Li De-yu,Weng Xiao-kui,Li Yan-hong.Mixed Data clustering label algorithm based on user’s interest domain[J].Journal of Shanxi University,2013,6(2):180-186 [12] Huang Zhe-xue.Extensions to the k-means algorithm for clustering large data sets with categorical values[J].Data Mining and Knowledge Discovery,1998,2(3):283-304 [13] Cover T M,Thomas J A.Elements of Information Theory(2nd Edition)[M].Hoboken:Wiley,2006:13-30 [14] 赵志刚,吴鑫,洪丹枫,等.基于信息熵的GLBP掌纹识别算法[J].计算机科学,2014,1(8):293-296 Zhao Zhi-gang,Wu Xin,Hong Dan-feng,et al.Palmprint Recognition Method Based on Energy Spectrum of GLBP [J].Computer Science,2006,27(11):180-186 [15] Frank A,Asuncion A.UCI Machine Learning Repository.2010.http://archive.ics.uci.edu/ml |
No related articles found! |
|