Computer Science ›› 2014, Vol. 41 ›› Issue (11): 195-202.doi: 10.11896/j.issn.1002-137X.2014.11.039

Previous Articles     Next Articles

Distributed Data Stream Clustering Based on LSH on Cloud Environments

QU Wu,WANG Li-jun and HAN Xiao-guang   

  • Online:2018-11-14 Published:2018-11-14

Abstract: In recent years,with the wide application of computer technology and internet technology in the field of industrial production and information processing,these applications will continuously produce large amounts of sequence data evolved over time and constitute time series data stream,such as internet news feed analysis,network intrusion detection system,stock markets analysis and sensor networks data analysis.The real-time clustering analysis of data stream is a hot issue of the current data stream mining.However,due to the high speed,large-scale data and real-time analysis,data must often be analyzed on the fly.Although the one-pass-through scanning algorithm is able to meet the needs,the lack of efficient clustering algorithms to identify and distinguish patterns limits the effectivity and scalability of this method.In order to solve the above problems,we proposed a novel stream clustering algorithm called DLCStream,which is based on LSH on cloud environments.It is a distributed data stream clustering approach that uses the Map-Reduce framework and LSH mechanism to quickly find the clustering pattern in the data stream.Finally,the theoretical analysis and experiment results illustrate that the DLCStream algorithm results is significantly more efficient in efficient parallel processing,scalablity,and quality of the clustering results compared with traditional data stream clustering framework CluStream algorithm.

Key words: Data stream clustering,Locality sensitive hashing,Map-Reduce frame,DLCStream

[1] Guerrieri A,Montresor A.DS-Means:Distributed data stream clustering [C]∥Euro-Par’12 Proceedings of the 18th International Conference on Parallel Processing.2012:260-271
[2] Aggarwal C,Han J,Wang J,et al.A Framework for Clustering Evolving Data Streams [C]∥Proceedings of the 29th International Conference on Very Large Databases.2003,29:81-92
[3] Aggarwal C,Han J,Wang J,et al.A Framework for Projected Clustering of High Dimensional Data Streams [C]∥Proceedings of the international conference on Very Large Databases.2004:852-863
[4] Shahnewaz S M,Rahman M A,Mahmud H.A Self Acting Initial Seed Selection Algorithm for K-means Clustering Based on Convex-Hull [J].Informatics Engineering and Information Science,Communications in Computer and Information Science,2011,252(5):641-650
[5] Krstinic′ D,Skelin A K,Slapnicar I.Fast two-step histogram-based image segmentation [J].Image Processing,IET,2011,5(1):63-72
[6] Gao Song,Zhang Cheng-cui,Chen Wei-bang.Identifying image spam authorship with variable bin-width histogram-based projective clustering [C]∥2011 IEEE International Conference on Multimedia and Expo (ICME).2011:1-6
[7] Zhou A,Cao F,Yan Y,et al.He Distributed Data Stream Clustering:A Fast EM-based Approach [C]∥IEEE 23rd International Conference on Data Engineering.2007:736-745
[8] Chen Y,Tu L.Density-based clustering for real-time stream data [C]∥Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2007:133-142
[9] Kriegel H-P,Krger P,Sander J,et al.Density-based clustering [J].Wiley Interdisciplinary Reviews:Data Mining and Know-ledge Discovery,2011,1(3):231-240
[10] Mao Guo-jun,Yang Yi.A Micro-cluster Based Ensemble A-pproach for Classifying Distributed Data Streams [C]∥2011 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI).2011,753-759
[11] Xie Cong-hua,Song Yu-qing,Chen Jian-mei.Fast medical image mixture density clustering segmentation using stratification sampling and kernel density estimation [J].Signal,Image and Video P rocessing,2011,5(2):257-267
[12] Cai Qiang,Rushton G,Bhaduri B.Validation tests of an improved kernel density estimation method for identifying disease clusters [J].Journal of Geographical Systems,2012,14(3):243-264
[13] Subramaniam S,Palpanas T,Papadopoulos D,et al.Online Outlier Detection in Sensor Data Using Non-Parametric Models [C]∥Proceedings of the 32nd International Conference on Very Large Databases.2006:187-198
[14] He Hai-bo,Cao Yuan.Kernel density estimation with streamdata based on self-organizing map [C]∥2011 IEEE Workshop on Evolving and Adaptive Intelligent Systems (EAIS).2011:24-30
[15] Heinz C,Seeger B.Wavelet Density Estimators over DataStreams [C]∥Proceedings of the 2005 ACM Symposium on Applied Computing.2005:578-579
[16] Tang Cheng-long,Wang Shi-gang,Xu Wei.Improved FuzzyClustering Algorithm Based on Data Weighted Approach [J].2010,32(6):1277-1283
[17] Henzinger M R,Raghavan P,Rajagopalan S.Computing on data streams,External memory algorithms [M]∥American Mathematical Society.Boston,MA,1999
[18] Golab L,zsu M T.Issues in data stream management [J].SIGMOD Records,2003,32(2):5-14
[19] Guha S,Mishra N,Motwani R,et al.Clustering data streams[C]∥Annual IEEE Symposium on Foundations of Computational Science.2000:359-366
[20] O’Callaghan L,Mishra N,Meyerson A,et al.Streaming-data algorithms for high-quality clustering [C]∥Proc.of 18th International Conference on Data Engineering.2002:685-694
[21] Bandyopadhyay S,Gianella C,Maulik U,et al.Clustering Distributed Data Streams in Peer-to-Peer Environments [J].Information Science Journal,2006,176(14):1952-1985
[22] Beringer J,Hullermeier E.Online Clustering of Parallel DataStreams [J].Data and Knowledge Engineering,2006,58(2):180-204
[23] Indyk P,Motwani R.Approximate nearest neighbors:towardsremoving the curse of dimensionality[C]∥Proceedings of the thirtieth annual ACM symposium on Theory of computing(STOC’ 98).New York,NY,USA:ACM,1998:604-613
[24] Datar M,Immorlica N,Indyk P,et al.Locality-sensitive hashing scheme based on p-stable distributions[C]∥Proceedings of the twentieth annual symposium on Computational geometry(SCG’ 04) .New York,NY,USA:ACM,2004:253-262
[25] Gionis A,Indyk P,Motwani R.Similarity Search in High Di-mensions via Hashing[C]∥Proceedings of the 25th International Conference on Very Large Data Bases(VLDB’ 99).San Francisco,CA,USA:Morgan Kaufmann Publishers Inc.,1999:518-529
[26] Zolotarev V M.One-dimensional stable distributions [M]∥Translations of Mathematical Monographs,American Mathematical Society,1986
[27] Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters [J].Communication of the ACM,2008,51(1):107-113
[28] Kaufman L,Rousseeuw P J.Finding Groups in Data:an Introduction to Cluster Analysis [M].John Wiley & Sons,1990
[29] Sun Xin,Jiao Yu.pGrid:Parallel Grid-Based Data Stream Clustering with MapReduce [R].Reports,OAK Ridge National Laboratory Oak Ridge,Applied Software Engineering Research Group,2009
[30] MIT Lincoln Lab.DARPA Intrusion Detection Data Sets.[2009-02].http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/index.html

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!