计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 74-82.doi: 10.11896/jsjkx.190600158

• 大数据与数据科学 • 上一篇    下一篇

基于欧拉核的数据流聚类算法

朱颖雯1,2, 杨君1   

  1. (三江学院计算机科学与工程学院 南京210012)1;
    (南京航空航天大学计算机科学与技术学院 南京210016)2
  • 收稿日期:2019-05-26 出版日期:2019-12-15 发布日期:2019-12-17
  • 通讯作者: 朱颖雯(1982-),女,硕士,讲师,主要研究方向为机器学习,E-mail:83782796@qq.com。
  • 作者简介:杨君(1982-),女,博士,副教授,主要研究方向为云计算。
  • 基金资助:
    本文受江苏省三江学院校科研项目(2018SJKY026),江苏省普通高校自然科学研究资助项目(17KJD520007),江苏省高等学校自然科学研究面上项目(18KJB520042)资助。

Euler Kernel-based Data Stream Clustering Algorithm

ZHU Ying-wen1,2, YANG Jun1   

  1. (School of Computer Science and Engineering,Sanjiang University,Nanjing 210012,China)1;
    (College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 210016,China)2
  • Received:2019-05-26 Online:2019-12-15 Published:2019-12-17

摘要: 随着云计算、物联网的快速发展,数据采集变得更加快捷和自动化。许多新型的应用领域中,诸如实时监控系统、车辆交通监控系统、电力消耗记录以及网络流量监控等,每时每刻都在产生大量的流数据,对数据流挖掘的研究成为了热点问题。聚类分析作为数据流挖掘领域的一个重要问题,在近期被高度重视并得到广泛研究。不同于传统的静态数据聚类问题,数据流聚类受到有限内存、一遍扫描、实时响应和概念漂移等许多约束。为此,文中基于欧拉核提出了一种针对数据流的聚类算法。首先通过欧拉核显式地将数据映射到相同维度的复数特征空间,然后在特征空间中基于GNG模型进行聚类。欧拉核依赖于非线性鲁棒的cosine度量,故对野值低敏感;显式的映射避免了一般的核聚类算法需要使用核技巧而无法处理数据流的问题。实验数据表明,基于欧拉核的数据流聚类算法不仅表现出了较好的聚类性能,还识别了数据的结构信息。

关键词: GNG, 核方法, 欧拉核, 数据流聚类

Abstract: With the advance of both cloud computing and internet of things,many applications generate huge amounts of data streams at fast speed.Examples include real-time surveillance systems,vehicle traffic monitoring systems,electricity consumption recording,and network traffic monitoring.Data stream mining has become a hot research topic.Its goal is to extract hidden knowledge/patterns from continuous data streams.Clustering,one of the most important problems in stream mining,has been highly explored.Different from traditional data clustering algorithm where given datasets are generally static and can be repeatedly read and processed,clustering data streams face more challenges due to having to satisfy such constraints as bounded memory,single-pass,real-time response and concept-drift detection.This paper pre-sented a new clustering algorithm for data streams,called EG-Stream,by combining the Euler kernel method with the Growing Neural Gas(GNG) model.It can not only maintain the benefit of nonlinear modeling using kernel function,but also significantly solve the large scale computational problem in kernel-based clustering.Euler kernel is relying on a nonlinear and robust cosine metric that is less sensitive to outliers.More important,it intrinsically induces an empirical map which maps data onto a complex space of the same dimension,and it takes these advantages to measure the similari-ty between data in a robust way without increasing the dimensionality of data,which avoids the problem that other kernel clustering algorithms can not deal with data streams.Although this method is embarrassingly simple just by incorporating the Euler kernel into GNG,the experimental results on variety of UCI datasets indicate that this method can still achieve comparable or even better performance than G-Stream algorithm,and identify the structural information from stream data.

Key words: Data stream clustering, Euler kernel, GNG, Kernel method

中图分类号: 

  • TP391
[1]NGUYEN H L,WOON Y K,NG W K.A survey on data stream clustering and classification[J].Knowledge and information systems,2015,45(3):535-569.
[2]GABER M M,ZASLAVSKY A,KRISHNASWAMY S.Mining data streams:a review[J].ACM Sigmod Record,2005,34(2):18-26.
[3]GAMA J,RODRIGUESP P.An overview on mining data streams[M]//Foundations of Computational,Intelligence Volume 6.Berlin:Springer,2009:29-45.
[4]AGGARWAL C C.Data streams:An overview and scientific applications[M]//Scientific Data Mining and Knowledge Discovery.Berlin:Springer,2009:377-397.
[5]SILVA J A,FARIA E R,BARROS R C,et al.Data stream clustering:A survey[J].ACM Computing Surveys (CSUR),2013,46(1):13.
[6]LI Y,YANG G,HE H,et al.A study of large-scale data clustering based on fuzzy clustering[J].Soft Computing,2016,20(8):3231-3242.
[7]ZHANG P,SHEN Q.Fuzzy c-means based coincidental link filtering in support of inferring social networks from spatiotemporal data streams[J].Soft Computing,2018,22(21):7015-7025.
[8]O’CALLAGHAN L,MISHRA N,MEYERSON A,et al. Streaming-data algorithms for high-quality clustering[C]//Proceedings 18th International Conference on Data Engineering.IEEE,2002:685-694.
[9]AGGARWAL C C,HAN J,WANG J,et al.A framework for clustering evolving data streams[C]//Proceedings of the 29th International Conference on Very Large Data Bases-Volume 29.VLDB Endowment,2003:81-92.
[10]AGGARWAL C C,HAN J,WANG J,et al.A framework for projected clustering of high dimensional data streams[C]//Proceedings of the Thirtieth International Conference on Very Large Data Bases-Volume 30.VLDB Endowment,2004:852-863.
[11]ZHOU A,CAO F,QIAN W,et al.Tracking clusters in evolving data streams over sliding windows[J].Knowledge and Information Systems,2008,15(2):181-214.
[12]UDOMMANETANAKIT K,RAKTHANMANON T,WAIYAMAI K.E-stream:Evolution-based technique for stream clustering[C]//International Conference on Advanced Data Mining and Applications.Berlin:Springer,2007:605-615.
[13]LÜHR S,LAZARESCU M.Incremental clustering of dynamic data streams using connectivity based representative points[J].Data & Knowledge Engineering,2009,68(1):1-27.
[14]CAO F,ESTERT M,QIAN W,et al.Density-based clustering over an evolving data stream with noise[C]//Proceedings of the 2006 SIAM International Conference on Data Mining.Society for Industrial and Applied Mathematics,2006:328-339.
[15]TASOULIS D K,ROSS G,ADAMS N M.Visualising the cluster structure of data streams[C]//International Symposium on Intelligent Data Analysis.Berlin:Springer,2007:81-92.
[16]KRIEGEL H P,KRÖGER P,NTOUTSI I,et al.Density based subspace clustering over dynamic data[C]//International Conference on Scientific and Statistical Database Management.Berlin:Springer,2011:387-404.
[17]CHEN Y,TU L.Density-based clustering for real-time stream data[C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2007:133-142.
[18]WAN L,NG W K,DANG X H,et al.Density-based clustering of data streams at multiple resolutions[J].ACM Transactions on Knowledge Discovery from Data (TKDD),2009,3(3):14.
[19]PARK N H,LEE W S.Statistical grid-based clustering over data streams[J].Acm Sigmod Record,2004,33(1):32-37.
[20]DANG X H,LEE V,NG W K,et al.An EM-based algorithm for clustering data streams in sliding windows[C]//International Conference on Database Systems for Advanced Applications.Berlin:Springer,2009:230-235.
[21]SMITH T,ALAHAKOON D.Growing self-organizing map for online continuous clustering[M]//Foundations of ComputationalIntelligence Volume 4.Berlin:Springer,2009:49-83.
[22]GHESMOUNE M,LEBBAH M,AZZAG H.A new growing neural gas for clustering data streams[J].Neural Networks,2016,78:36-50.
[23]SCHOLKOPF B,SMOLA A,MULLER K R.Nonlinear component analysis as a kernel eigenvalue problem[J].Neural computation,1998,10(5):1299-1319.
[24]MIKA S,RATSCH G,WESTON J,et al.Fish discriminant analysis with kernels[C]//Proceedings of IEEE Neural Networks for Signal Processing Workshop(NNSP).Madison,WI:IEEE Press,1999:41-48.
[25]ZHANG D Q.Kernel-Based Associative Memories,Clustering Algorithms and their Applications[D].Nanjing:Nanjing University of Aeronautics and Astronautics,2004.(in Chinese)
张道强.基于核的联想记忆及聚类算法的研究与应用[D].南京:南京航空航天大学,2004.
[26]ZHANG D Q,CHEN S C.Kernel-Based fuzzy and possibilities C-Means clustering[C]//Proceedings of the 13th International Conference on Artificial Neural Networks.Istanbul,Turkey:Springer,2003:122-125.
[27]GIROLAMI M.Mercer-based clustering in feature space[J]. IEEE Transactions on Neural Networks,2002,13(3):780-784.
[28]LIWICKI S,TZIMIROPOULOS G,ZAFEIRIOU S,et al.Euler principal component analysis[J].International Journal of Computer Vision,2013,101(3):498-518.
[29]WU J S,ZHENG W S,LAI J H.Euler Clustering[C]//Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence.2013:1792-1798.
[30]CAO F,ESTERT M,QIAN W,et al.Density-based clustering over an evolving data stream with noise[C]//Proceedings of the 2006 SIAM International Conference on Data Mining.Society for Industrial and Applied Mathematics,2006:328-339.
[31]STREHL A,GHOSH J.Cluster ensembles-a knowledge reuse framework for combining multiple partitions[J].Journal of Machine Learning Research,2002,3:583-617.
[32]RAND W M.Objective criteria for the evaluation of clustering methods[J].Journal of the American Statistical Association,1971,66(336):846-850.
[1] 徐建民, 孙朋, 吴树芳.
传播路径树核学习的微博谣言检测方法
Microblog Rumor Detection Method Based on Propagation Path Tree Kernel Learning
计算机科学, 2022, 49(6): 342-349. https://doi.org/10.11896/jsjkx.210400096
[2] 王中元, 刘惊雷.
基于二阶近邻的核子空间聚类
Kernel Subspace Clustering Based on Second-order Neighbors
计算机科学, 2021, 48(6): 86-95. https://doi.org/10.11896/jsjkx.200800180
[3] 孙汉博,冯国灿.
基于改进的Porter Stemmer词干提取与核方法的垃圾邮件过滤算法
Spam Filter Algorithm with Improved Porter Stemmer and Kernels Methods
计算机科学, 2017, 44(Z6): 61-67. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.012
[4] 龚劬,许凯强.
有监督的无参数核局部保持投影及人脸识别
Parameter-less Supervised Kernel Locality Preserving Projection and Face Recognition
计算机科学, 2016, 43(9): 301-304. https://doi.org/10.11896/j.issn.1002-137X.2016.09.060
[5] 郑建炜,孔晨辰,王万良,邱虹,章杭科.
有监督核化邻域投影分析算法
Kernel-based Supervised Neighborhood Projection Analysis Algorithm
计算机科学, 2016, 43(6): 312-315. https://doi.org/10.11896/j.issn.1002-137X.2016.06.062
[6] 王裴岩,蔡东风.
一种基于核距离的核函数度量方法
Distance-based Kernel Evaluation Measure
计算机科学, 2014, 41(2): 72-75.
[7] 曲武,王莉军,韩晓光.
云环境下基于LSH的分布式数据流聚类算法
Distributed Data Stream Clustering Based on LSH on Cloud Environments
计算机科学, 2014, 41(11): 195-202. https://doi.org/10.11896/j.issn.1002-137X.2014.11.039
[8] 王昕,刘颖,范九伦.
基于多核Fisher判别分析的人脸特征提取
Face Feature Extraction Based on Weighted Multiple Kernel Fisher Discriminant Analysis
计算机科学, 2012, 39(9): 262-265.
[9] 方益民,张玲,孙为民,徐保国.
求解非正定核Huber-SVR的SMO算法
SMO Algorithm for Resolving Huber-SVR with Non-positive Kernels
计算机科学, 2010, 37(7): 212-216.
[10] .
动态SVDD算法及其应用

计算机科学, 2009, 36(3): 156-157.
[11] .
基于核的自适应聚类及其在入侵检测中的应用

计算机科学, 2008, 35(12): 190-191.
[12] 贾磊 廖士中.
超核函数支持向量机

计算机科学, 2008, 35(12): 148-150.
[13] .
基于核的PP主成分分析及其在离群聚类中的应用

计算机科学, 2007, 34(9): 131-134.
[14] .
核函数的性质及其构造方法

计算机科学, 2006, 33(6): 172-174.
[15] 朱美琳 刘向东 陈世福.
核方法在人脸识别中的应用

计算机科学, 2003, 30(5): 82-84.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!