基于数据概要描述的分布式数据流聚类模型与算法

计算机科学 ›› 2013, Vol. 40 ›› Issue (6): 187-191.

基于数据概要描述的分布式数据流聚类模型与算法

毛国君,曹永存

中央财经大学信息学院北京100081;中央民族大学信息工程学院北京100081

出版日期:2018-11-16 发布日期:2018-11-16
基金资助:
本文受国家自然科学基金项目(62173293),中央财经大学教改项目基金资助

Clustering Models and Algorithms for Distributed Data Streams Based on Data Synopsis

MAO Guo-jun and CAO Yong-cun

Online:2018-11-16 Published:2018-11-16

摘要/Abstract

摘要： 数据流挖掘可有效解决大容量流式数据的知识发现问题,并已得到广泛研究。数据流的一个典型的例子是传感器采集的流式数据。然而,随着传感器网络的应用普及,这些流式数据在很多情况下是分布式采集和管理的,这就必然导致分布式地挖掘数据流的需求。分布式数据流挖掘的最大障碍是由分布式而导致的挖掘质量或者效率问题。为适应分布式数据流的聚类挖掘,探讨了分布式数据流的挖掘模型,并且基于该模型设计了对应的概要数据结构和关键的挖掘算法,给出了算法的理论评估或者实验验证。实验说明,提出的模型和算法可以有效地减少数据通信代价,并且能保证较高的全局模式的聚类质量。

关键词: 分布式数据流,数据概要,增量式聚类,全局模式

Abstract: Mining data streams aims at discovering knowledge from a large of streaming data,in which enough efforts have been done in recent years．As a typical example,the data to be collected by a sensor is a format of data streams．However,in the technical environment of a sensor network,multiple sensors always are set and they collect data in a distributed way,so mining data streams with a distributed way is making a challenge issue．Most ongoing studies for mining distributed data streams are suffering from the problems of accuracy or efficiency．In this paper,the model for clustering a distributed data stream was discussed,including a new synopsis data structure for summarizing data streams and some effective algorithms for key mining phases．The reasons of presented algorithms were also discussed．Experimental results demonstrate that presented models and algorithms have less transmission cost and higher clustering qua-lity to mine the global pattern from distributed data streams.

Key words: Distributed data stream,Data synopsis,Incremental clustering,Global pattern

毛国君,曹永存. 基于数据概要描述的分布式数据流聚类模型与算法[J]. 计算机科学, 2013, 40(6): 187-191. https://doi.org/

MAO Guo-jun and CAO Yong-cun. Clustering Models and Algorithms for Distributed Data Streams Based on Data Synopsis[J]. Computer Science, 2013, 40(6): 187-191. https://doi.org/

参考文献

[1] Babcock B,Babu S,Datar M．Models and issues in data stream systems[C]∥Proceedings of the 21st ACM Symposium on Principles of Database Systems．Madison,WI,USA:ACM,2002:1-16
[2] Khalilian M,Mustapha N．Data stream clustering:challengesand issues[C]∥Proceedings of 2010International MultiConfe-rence of Engineering and Computer Scientists．Hong Kong,China:Newswood Limited International Association of Engineers,2010:566-569
[3] Rajasegarar S,Leckie C,Palaniswami M．Distributed anomalydetection in wireless sensor networks[C]∥Proceedings of the 10th IEEE Singapore International Conference on Communications Systems．Singapore,IEEE,2006:1-5
[4] Zhang Q,Liu J,Wang W．Approximate clustering on distributed data streams[C]∥Proceedings of IEEE 24th International Conference on Data Engineering.Cancun,Mexico:IEEE,2008:1131-1139
[5] Graham C,Muthukrishnan S,Zhuang W．Conquering the divide:continuous clustering of distributed data streams[C]∥Procee-dings of the 23rd International Conference on Data Engineering．Istanbul,Turkey:IEEE,2007:1036-1045
[6] Hajiee M．A new distributed clustering algorithm based on K-means algorithm[C]∥Proceedings of the 3rd International Conference on Advanced Computer Theory and Engineering．Piscata-way．NJ,USA:IEEE,2010:2408-2411
[7] Januzai E,Kriegel H P,Pfeifle M．DBDC:density based distributed clustering[C]∥Proceedings of Advances in Database Technology-EDBT 20049th International Conference on Extending Database Technology．Berlin,Germany:IEEE,2004:88-105
[8] Johnson E,Kargupta H．Collective,Hierarchical clustering from distributed,heterogeneous data[C]∥Proceedings of 2000Large-Scale Parallel Data Mining．London,UK:Springer-Verlag,2000:221-244
[9] Domingos P,Hulten G．Mining high-speed data streams[C]∥Proceedings of KDD-2000Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining．Boston,MA,USA:IEEE,2000:71-80
[10] Zhang T,Raghu R,Livny M．BIRCH:an efficient data clustering method for very large databases[J]．Sigmod Record,1996,25(2):103-114
[11] Rodriques P P,Gama J,Lopes L．Clustering distributed sensor data streams[C]∥Proceedings of Machine Learning and Know-ledge Discovery in Databases.Antwerp,Belgium:Springer-Verlag,2008:282-297
[12] 郑铎,吴世伟.正态分布函数计算的建议及其反函数的非迭代算法[J]．河海大学学报:自然科学版,1993(02):61-64
[13] 朱晓玲,姜浩.任意概率分布的伪随机数研究和实现[J]．计算机技术与发展,2007,17(12):116-118
[14] O’Callaghan L,Mishra N,Meyerson A．Streaming-data algo-rithms for high-quality clustering[C]∥Proceedings of 18th International Conference on Data Engineering．Los Alamitos,CA,USA:IEEE,2002:685-94
[15] Gorawski M,Pluciennik-Psota E.Distributed data mining me-thodology for clustering and classification model[C]∥Procee-dings of 10th International Conference on Artificial Intelligence and Soft Computing．Berlin,Germany:The Institution of Engineering and Technology,2010:323-30
[16] 孙岳,毛国君,刘旭．基于多分类器的数据流中的概念漂移挖掘[J]．自动化学报,2008,34(1):93-97
[17] 吴枫,仲妍,吴泉源．基于时间衰减模型的数据流频繁模式挖掘[J]．自动化学报,2010,36(5):674-684

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed