计算机科学 ›› 2014, Vol. 41 ›› Issue (11): 195-202.doi: 10.11896/j.issn.1002-137X.2014.11.039

• 信息安全 • 上一篇    下一篇

云环境下基于LSH的分布式数据流聚类算法

曲武,王莉军,韩晓光   

  1. 清华大学计算机科学与技术系 北京100084;北京启明星辰信息安全技术有限公司核心研究院 北京100193;中国科学技术信息研究所 北京100038;北京科技大学计算机与通信工程学院 北京100083
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家“九七三”重点基础研究发展规划项目基金(2007CB310803),国家自然科学基金重点项目(61035004),国家自然科学基金(60875029),国家科技部博士后基金(2013M541005)资助

Distributed Data Stream Clustering Based on LSH on Cloud Environments

QU Wu,WANG Li-jun and HAN Xiao-guang   

  • Online:2018-11-14 Published:2018-11-14

摘要: 近年来,随着计算机技术、信息处理技术在工业生产、信息处理等领域的广泛应用,会连续不断地产生大量随时间演变的序列型数据,构成时间序列数据流,如互联网新闻语料分析、网络入侵检测、股市行情分析和传感器网络数据分析等。实时数据流聚类分析是当前数据流挖掘研究的热点问题。单遍扫描算法虽然满足数据流高速、数据规模较大和实时分析的需求, 但因缺乏有效的聚类算法来识别和区分模式而限制了其有效性和可扩展性。为了解决以上问题,提出云环境下基于LSH的分布式数据流聚类算法DLCStream,通过引入Map-Reduce框架和位置敏感哈希机制,DLCStream算法能够快速找到数据流中的聚类模式。通过详细的理论分析和实验验证表明,与传统的数据流聚类框架CluStream算法相比,DLCStream算法在高效并行处理、可扩展性和聚类结果质量方面更有优势。

关键词: 数据流聚类,位置敏感哈希方法,Map-Reduce框架,DLCStream算法

Abstract: In recent years,with the wide application of computer technology and internet technology in the field of industrial production and information processing,these applications will continuously produce large amounts of sequence data evolved over time and constitute time series data stream,such as internet news feed analysis,network intrusion detection system,stock markets analysis and sensor networks data analysis.The real-time clustering analysis of data stream is a hot issue of the current data stream mining.However,due to the high speed,large-scale data and real-time analysis,data must often be analyzed on the fly.Although the one-pass-through scanning algorithm is able to meet the needs,the lack of efficient clustering algorithms to identify and distinguish patterns limits the effectivity and scalability of this method.In order to solve the above problems,we proposed a novel stream clustering algorithm called DLCStream,which is based on LSH on cloud environments.It is a distributed data stream clustering approach that uses the Map-Reduce framework and LSH mechanism to quickly find the clustering pattern in the data stream.Finally,the theoretical analysis and experiment results illustrate that the DLCStream algorithm results is significantly more efficient in efficient parallel processing,scalablity,and quality of the clustering results compared with traditional data stream clustering framework CluStream algorithm.

Key words: Data stream clustering,Locality sensitive hashing,Map-Reduce frame,DLCStream

[1] Guerrieri A,Montresor A.DS-Means:Distributed data stream clustering [C]∥Euro-Par’12 Proceedings of the 18th International Conference on Parallel Processing.2012:260-271
[2] Aggarwal C,Han J,Wang J,et al.A Framework for Clustering Evolving Data Streams [C]∥Proceedings of the 29th International Conference on Very Large Databases.2003,29:81-92
[3] Aggarwal C,Han J,Wang J,et al.A Framework for Projected Clustering of High Dimensional Data Streams [C]∥Proceedings of the international conference on Very Large Databases.2004:852-863
[4] Shahnewaz S M,Rahman M A,Mahmud H.A Self Acting Initial Seed Selection Algorithm for K-means Clustering Based on Convex-Hull [J].Informatics Engineering and Information Science,Communications in Computer and Information Science,2011,252(5):641-650
[5] Krstinic′ D,Skelin A K,Slapnicar I.Fast two-step histogram-based image segmentation [J].Image Processing,IET,2011,5(1):63-72
[6] Gao Song,Zhang Cheng-cui,Chen Wei-bang.Identifying image spam authorship with variable bin-width histogram-based projective clustering [C]∥2011 IEEE International Conference on Multimedia and Expo (ICME).2011:1-6
[7] Zhou A,Cao F,Yan Y,et al.He Distributed Data Stream Clustering:A Fast EM-based Approach [C]∥IEEE 23rd International Conference on Data Engineering.2007:736-745
[8] Chen Y,Tu L.Density-based clustering for real-time stream data [C]∥Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2007:133-142
[9] Kriegel H-P,Krger P,Sander J,et al.Density-based clustering [J].Wiley Interdisciplinary Reviews:Data Mining and Know-ledge Discovery,2011,1(3):231-240
[10] Mao Guo-jun,Yang Yi.A Micro-cluster Based Ensemble A-pproach for Classifying Distributed Data Streams [C]∥2011 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI).2011,753-759
[11] Xie Cong-hua,Song Yu-qing,Chen Jian-mei.Fast medical image mixture density clustering segmentation using stratification sampling and kernel density estimation [J].Signal,Image and Video P rocessing,2011,5(2):257-267
[12] Cai Qiang,Rushton G,Bhaduri B.Validation tests of an improved kernel density estimation method for identifying disease clusters [J].Journal of Geographical Systems,2012,14(3):243-264
[13] Subramaniam S,Palpanas T,Papadopoulos D,et al.Online Outlier Detection in Sensor Data Using Non-Parametric Models [C]∥Proceedings of the 32nd International Conference on Very Large Databases.2006:187-198
[14] He Hai-bo,Cao Yuan.Kernel density estimation with streamdata based on self-organizing map [C]∥2011 IEEE Workshop on Evolving and Adaptive Intelligent Systems (EAIS).2011:24-30
[15] Heinz C,Seeger B.Wavelet Density Estimators over DataStreams [C]∥Proceedings of the 2005 ACM Symposium on Applied Computing.2005:578-579
[16] Tang Cheng-long,Wang Shi-gang,Xu Wei.Improved FuzzyClustering Algorithm Based on Data Weighted Approach [J].2010,32(6):1277-1283
[17] Henzinger M R,Raghavan P,Rajagopalan S.Computing on data streams,External memory algorithms [M]∥American Mathematical Society.Boston,MA,1999
[18] Golab L,zsu M T.Issues in data stream management [J].SIGMOD Records,2003,32(2):5-14
[19] Guha S,Mishra N,Motwani R,et al.Clustering data streams[C]∥Annual IEEE Symposium on Foundations of Computational Science.2000:359-366
[20] O’Callaghan L,Mishra N,Meyerson A,et al.Streaming-data algorithms for high-quality clustering [C]∥Proc.of 18th International Conference on Data Engineering.2002:685-694
[21] Bandyopadhyay S,Gianella C,Maulik U,et al.Clustering Distributed Data Streams in Peer-to-Peer Environments [J].Information Science Journal,2006,176(14):1952-1985
[22] Beringer J,Hullermeier E.Online Clustering of Parallel DataStreams [J].Data and Knowledge Engineering,2006,58(2):180-204
[23] Indyk P,Motwani R.Approximate nearest neighbors:towardsremoving the curse of dimensionality[C]∥Proceedings of the thirtieth annual ACM symposium on Theory of computing(STOC’ 98).New York,NY,USA:ACM,1998:604-613
[24] Datar M,Immorlica N,Indyk P,et al.Locality-sensitive hashing scheme based on p-stable distributions[C]∥Proceedings of the twentieth annual symposium on Computational geometry(SCG’ 04) .New York,NY,USA:ACM,2004:253-262
[25] Gionis A,Indyk P,Motwani R.Similarity Search in High Di-mensions via Hashing[C]∥Proceedings of the 25th International Conference on Very Large Data Bases(VLDB’ 99).San Francisco,CA,USA:Morgan Kaufmann Publishers Inc.,1999:518-529
[26] Zolotarev V M.One-dimensional stable distributions [M]∥Translations of Mathematical Monographs,American Mathematical Society,1986
[27] Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters [J].Communication of the ACM,2008,51(1):107-113
[28] Kaufman L,Rousseeuw P J.Finding Groups in Data:an Introduction to Cluster Analysis [M].John Wiley & Sons,1990
[29] Sun Xin,Jiao Yu.pGrid:Parallel Grid-Based Data Stream Clustering with MapReduce [R].Reports,OAK Ridge National Laboratory Oak Ridge,Applied Software Engineering Research Group,2009
[30] MIT Lincoln Lab.DARPA Intrusion Detection Data Sets.[2009-02].http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/index.html

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!