计算机科学 ›› 2016, Vol. 43 ›› Issue (5): 223-229.doi: 10.11896/j.issn.1002-137X.2016.05.041

• 人工智能 • 上一篇    下一篇

基于近邻传播的文本数据流聚类算法研究

李一鸣,倪丽萍,方清华,刘慧婷   

  1. 合肥工业大学管理学院 合肥230009;合肥工业大学过程优化与智能决策教育部重点实验室 合肥230009,合肥工业大学管理学院 合肥230009;合肥工业大学过程优化与智能决策教育部重点实验室 合肥230009,合肥工业大学管理学院 合肥230009;合肥工业大学过程优化与智能决策教育部重点实验室 合肥230009,安徽大学计算机科学与技术学院 合肥230601
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金(71301041,61202227,71271071),国家自然科学基金重点项目(71490725)资助

Research of Text Data Streams Clustering Algorithm Based on Affinity Propagation

LI Yi-ming, NI Li-ping, FANG Qing-hua and LIU Hui-ting   

  • Online:2018-12-01 Published:2018-12-01

摘要: 随着大数据时代的到来,网络上产生了大量非结构化文本数据流,这些文本数据流具有动态、高维、稀疏等特征。针对这些特点,首先将传统的AP算法及流式文本数据特征相结合,然后提出文本数据流聚类算法——OAP-s算法。该算法通过在AP算法上引入衰减因子,对聚类中心结果进行衰减,同时将当前时间窗口的聚类中心带入到下一时间窗口中进行聚类。针对OAP-s算法的不足,又提出了OWAP-s算法。该算法在OAP-s算法模型的基础上定义了加权相似度,并通过引入吸引度因子,使得历史聚类中心更具吸引性,得到更精确的聚类结果。同时,两种算法均采用滑动时间窗口模式,使算法既能体现数据流的时态特征,又能反映数据流的分布特征。实验结果表明,两种算法在聚类精确度、稳定性方面均高于OSKM算法,而且具有较好的伸缩性和可扩展性。

关键词: 数据挖掘,近邻传播聚类,文本数据,滑动时间窗口,权重

Abstract: With the advent of the era of big data,a large amount of unstructured text data streams have emerged online.Those data streams are dynamic,high-dimensional and sparse.For these characteristics and on the basis of the traditionalAP algorithm,a text data stream clustering algorithm,called OAP-s algorithm,was proposed in this paper.By introducing attenuation factor in the AP algorithm,OAP-s algorithm passes the clustering center of the current window to the next window,while attenuating the results.However,this OAP-s algorithm also has some shortcomings.Therefore,we proposed another text data stream clustering algorithm,called OWAP-s algorithm.Based on the OAP-s algorithm,OWAP-s algorithm defines the weighted similarity,introduces attractive factor and makes the historic clustering center more attractive,thus obtains more accurate clustering results.Meanwhile,both algorithms adopt the sliding time window model,which reflects the temporal characteristics as well as the distribution of the data stream.Experimental results show that both algorithms are flexible and extensible,and they are more accurate and stable than OSKM algorithm.

Key words: Data mining,AP clustering,Text data,Sliding window,Weight

[1] PhridviRaja M S B,Chintakindi,Srinivasb,et al.Clustering Text Data Streams-A Tree Based Approach with Ternary Function and Ternary Feature Vector [J].Procedia Computer Science,2014(31):976-984
[2] Huang Guang-yan,He Jing.Mining Streams of Short Text for Analysis of World-wide Event Evolutions[J].World Wide Web,2015,18(5):1201-1217
[3] Zhang Jian-peng,Chen Fu-cai,Li Shao-mei,et al.Online Clustering Algorithm for Evolutionary Data Stream Based on Affine Propagation [J].Pattern Recognition and Artificial Intelligence,2014,27(5):443-451(in Chinese) 张建朋,陈福才,李邵梅,等.基于仿射传播的进化数据流在线聚类算法[J].模式识别与人工智能,2014,27(5):443-451
[4] Aggarwal C C.Mining Text Streams[M]∥Mining Text Data,2012:297-321
[5] Gong Ling-hui,Zeng Jian-ping,Zhang Shi-yong.Text StreamClustering Algorithm Based on Adaptive Feature Selection[J].Expert Systems with Applications,2011,8(3):1393-1399
[6] Aggarwal C C,Han J W,Wang J Y,et al.A Framework for Clustering Evolving Data Streams[C]∥Proc of the 29th International Conference on Very Large Data Bases.Berlin,Germany,2003:81-92
[7] Aggarwal C C,Yu P S.On Clustering Massive Text and Categorical Data Streams[J].Knowledge and Information Systems,2010,24(2):171-196
[8] Aggarwal C C,Han J W,Wang J Y,et al.A Framework for Projected Clustering of High Dimensional Data Streams[C]∥Proc of the 30th International Conference on Very Large Data Bases.Toronto,Canada,2004:852-863
[9] Liu Y B,Cai J R,Yin J,et al.Clustering Text Data Streams[J].Journal of Computer Science & Technology,2008,23(1):112-128
[10] Shi Zhong.Efficient Streaming Text Clustering[J].Neural Networks,2005,18(5/6):790-798
[11] Frey B J,Dueck D.Clustering by Passing Messages between Data Points[J].Science,2007,315(5814):972-976
[12] Guo Xiu-juan,Chen Ying.Analysis and Application of AP Clustering algorithm [J].Journal of Jilin Jianzhu University,2013,30(4):58-61(in Chinese) 郭秀娟,陈莹.AP聚类算法的分析与应用[J].吉林建筑大学学报,2013,30(4):58-61
[13] Strehl A,Ghosh J,Mooney R J.Impact of Similarity Measures on Web-page Clustering[C]∥AAAI Workshop on AI for Web Search.2000:58-64
[14] Zhang Zhen,Wang Bin-qiang,Yi Peng,et al.A Hierarchical Combination of Semi Supervised Neighbor Propagation Clustering Algorithm [J].Journal of Electronic & Information Technology,2013,5(3):645-651(in Chinese) 张震,汪斌强,伊鹏,等.一种分层组合的半监督近邻传播聚类算法[J].电子与信息学报,2013,5(3):645-651
[15] Zhang X L,Furtlehner C,Sebag M.Data Streaming with Affinity Propagation[C]∥Proc of the European Conference on Machine Learning and Knowledge Discovery in Databases.Antwerp,Belgium,2008:628-643
[16] Karypis G.CLUTO-a clustering toolkit.2002.http://www-users.cs.umn.edu/~karypis/cluto

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!