计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 74-82.doi: 10.11896/jsjkx.190600158
朱颖雯1,2, 杨君1
ZHU Ying-wen1,2, YANG Jun1
摘要: 随着云计算、物联网的快速发展,数据采集变得更加快捷和自动化。许多新型的应用领域中,诸如实时监控系统、车辆交通监控系统、电力消耗记录以及网络流量监控等,每时每刻都在产生大量的流数据,对数据流挖掘的研究成为了热点问题。聚类分析作为数据流挖掘领域的一个重要问题,在近期被高度重视并得到广泛研究。不同于传统的静态数据聚类问题,数据流聚类受到有限内存、一遍扫描、实时响应和概念漂移等许多约束。为此,文中基于欧拉核提出了一种针对数据流的聚类算法。首先通过欧拉核显式地将数据映射到相同维度的复数特征空间,然后在特征空间中基于GNG模型进行聚类。欧拉核依赖于非线性鲁棒的cosine度量,故对野值低敏感;显式的映射避免了一般的核聚类算法需要使用核技巧而无法处理数据流的问题。实验数据表明,基于欧拉核的数据流聚类算法不仅表现出了较好的聚类性能,还识别了数据的结构信息。
中图分类号:
[1]NGUYEN H L,WOON Y K,NG W K.A survey on data stream clustering and classification[J].Knowledge and information systems,2015,45(3):535-569.[2]GABER M M,ZASLAVSKY A,KRISHNASWAMY S.Mining data streams:a review[J].ACM Sigmod Record,2005,34(2):18-26.[3]GAMA J,RODRIGUESP P.An overview on mining data streams[M]//Foundations of Computational,Intelligence Volume 6.Berlin:Springer,2009:29-45.[4]AGGARWAL C C.Data streams:An overview and scientific applications[M]//Scientific Data Mining and Knowledge Discovery.Berlin:Springer,2009:377-397.[5]SILVA J A,FARIA E R,BARROS R C,et al.Data stream clustering:A survey[J].ACM Computing Surveys (CSUR),2013,46(1):13.[6]LI Y,YANG G,HE H,et al.A study of large-scale data clustering based on fuzzy clustering[J].Soft Computing,2016,20(8):3231-3242.[7]ZHANG P,SHEN Q.Fuzzy c-means based coincidental link filtering in support of inferring social networks from spatiotemporal data streams[J].Soft Computing,2018,22(21):7015-7025.[8]O’CALLAGHAN L,MISHRA N,MEYERSON A,et al. Streaming-data algorithms for high-quality clustering[C]//Proceedings 18th International Conference on Data Engineering.IEEE,2002:685-694.[9]AGGARWAL C C,HAN J,WANG J,et al.A framework for clustering evolving data streams[C]//Proceedings of the 29th International Conference on Very Large Data Bases-Volume 29.VLDB Endowment,2003:81-92.[10]AGGARWAL C C,HAN J,WANG J,et al.A framework for projected clustering of high dimensional data streams[C]//Proceedings of the Thirtieth International Conference on Very Large Data Bases-Volume 30.VLDB Endowment,2004:852-863.[11]ZHOU A,CAO F,QIAN W,et al.Tracking clusters in evolving data streams over sliding windows[J].Knowledge and Information Systems,2008,15(2):181-214.[12]UDOMMANETANAKIT K,RAKTHANMANON T,WAIYAMAI K.E-stream:Evolution-based technique for stream clustering[C]//International Conference on Advanced Data Mining and Applications.Berlin:Springer,2007:605-615.[13]LÜHR S,LAZARESCU M.Incremental clustering of dynamic data streams using connectivity based representative points[J].Data & Knowledge Engineering,2009,68(1):1-27.[14]CAO F,ESTERT M,QIAN W,et al.Density-based clustering over an evolving data stream with noise[C]//Proceedings of the 2006 SIAM International Conference on Data Mining.Society for Industrial and Applied Mathematics,2006:328-339.[15]TASOULIS D K,ROSS G,ADAMS N M.Visualising the cluster structure of data streams[C]//International Symposium on Intelligent Data Analysis.Berlin:Springer,2007:81-92.[16]KRIEGEL H P,KRÖGER P,NTOUTSI I,et al.Density based subspace clustering over dynamic data[C]//International Conference on Scientific and Statistical Database Management.Berlin:Springer,2011:387-404.[17]CHEN Y,TU L.Density-based clustering for real-time stream data[C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2007:133-142.[18]WAN L,NG W K,DANG X H,et al.Density-based clustering of data streams at multiple resolutions[J].ACM Transactions on Knowledge Discovery from Data (TKDD),2009,3(3):14.[19]PARK N H,LEE W S.Statistical grid-based clustering over data streams[J].Acm Sigmod Record,2004,33(1):32-37.[20]DANG X H,LEE V,NG W K,et al.An EM-based algorithm for clustering data streams in sliding windows[C]//International Conference on Database Systems for Advanced Applications.Berlin:Springer,2009:230-235.[21]SMITH T,ALAHAKOON D.Growing self-organizing map for online continuous clustering[M]//Foundations of ComputationalIntelligence Volume 4.Berlin:Springer,2009:49-83.[22]GHESMOUNE M,LEBBAH M,AZZAG H.A new growing neural gas for clustering data streams[J].Neural Networks,2016,78:36-50.[23]SCHOLKOPF B,SMOLA A,MULLER K R.Nonlinear component analysis as a kernel eigenvalue problem[J].Neural computation,1998,10(5):1299-1319.[24]MIKA S,RATSCH G,WESTON J,et al.Fish discriminant analysis with kernels[C]//Proceedings of IEEE Neural Networks for Signal Processing Workshop(NNSP).Madison,WI:IEEE Press,1999:41-48.[25]ZHANG D Q.Kernel-Based Associative Memories,Clustering Algorithms and their Applications[D].Nanjing:Nanjing University of Aeronautics and Astronautics,2004.(in Chinese) 张道强.基于核的联想记忆及聚类算法的研究与应用[D].南京:南京航空航天大学,2004.[26]ZHANG D Q,CHEN S C.Kernel-Based fuzzy and possibilities C-Means clustering[C]//Proceedings of the 13th International Conference on Artificial Neural Networks.Istanbul,Turkey:Springer,2003:122-125.[27]GIROLAMI M.Mercer-based clustering in feature space[J]. IEEE Transactions on Neural Networks,2002,13(3):780-784.[28]LIWICKI S,TZIMIROPOULOS G,ZAFEIRIOU S,et al.Euler principal component analysis[J].International Journal of Computer Vision,2013,101(3):498-518.[29]WU J S,ZHENG W S,LAI J H.Euler Clustering[C]//Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence.2013:1792-1798.[30]CAO F,ESTERT M,QIAN W,et al.Density-based clustering over an evolving data stream with noise[C]//Proceedings of the 2006 SIAM International Conference on Data Mining.Society for Industrial and Applied Mathematics,2006:328-339.[31]STREHL A,GHOSH J.Cluster ensembles-a knowledge reuse framework for combining multiple partitions[J].Journal of Machine Learning Research,2002,3:583-617.[32]RAND W M.Objective criteria for the evaluation of clustering methods[J].Journal of the American Statistical Association,1971,66(336):846-850. |
[1] | 徐建民, 孙朋, 吴树芳. 传播路径树核学习的微博谣言检测方法 Microblog Rumor Detection Method Based on Propagation Path Tree Kernel Learning 计算机科学, 2022, 49(6): 342-349. https://doi.org/10.11896/jsjkx.210400096 |
[2] | 王中元, 刘惊雷. 基于二阶近邻的核子空间聚类 Kernel Subspace Clustering Based on Second-order Neighbors 计算机科学, 2021, 48(6): 86-95. https://doi.org/10.11896/jsjkx.200800180 |
[3] | 孙汉博,冯国灿. 基于改进的Porter Stemmer词干提取与核方法的垃圾邮件过滤算法 Spam Filter Algorithm with Improved Porter Stemmer and Kernels Methods 计算机科学, 2017, 44(Z6): 61-67. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.012 |
[4] | 龚劬,许凯强. 有监督的无参数核局部保持投影及人脸识别 Parameter-less Supervised Kernel Locality Preserving Projection and Face Recognition 计算机科学, 2016, 43(9): 301-304. https://doi.org/10.11896/j.issn.1002-137X.2016.09.060 |
[5] | 郑建炜,孔晨辰,王万良,邱虹,章杭科. 有监督核化邻域投影分析算法 Kernel-based Supervised Neighborhood Projection Analysis Algorithm 计算机科学, 2016, 43(6): 312-315. https://doi.org/10.11896/j.issn.1002-137X.2016.06.062 |
[6] | 王裴岩,蔡东风. 一种基于核距离的核函数度量方法 Distance-based Kernel Evaluation Measure 计算机科学, 2014, 41(2): 72-75. |
[7] | 曲武,王莉军,韩晓光. 云环境下基于LSH的分布式数据流聚类算法 Distributed Data Stream Clustering Based on LSH on Cloud Environments 计算机科学, 2014, 41(11): 195-202. https://doi.org/10.11896/j.issn.1002-137X.2014.11.039 |
[8] | 王昕,刘颖,范九伦. 基于多核Fisher判别分析的人脸特征提取 Face Feature Extraction Based on Weighted Multiple Kernel Fisher Discriminant Analysis 计算机科学, 2012, 39(9): 262-265. |
[9] | 方益民,张玲,孙为民,徐保国. 求解非正定核Huber-SVR的SMO算法 SMO Algorithm for Resolving Huber-SVR with Non-positive Kernels 计算机科学, 2010, 37(7): 212-216. |
[10] | . 动态SVDD算法及其应用 计算机科学, 2009, 36(3): 156-157. |
[11] | . 基于核的自适应聚类及其在入侵检测中的应用 计算机科学, 2008, 35(12): 190-191. |
[12] | 贾磊 廖士中. 超核函数支持向量机 计算机科学, 2008, 35(12): 148-150. |
[13] | . 基于核的PP主成分分析及其在离群聚类中的应用 计算机科学, 2007, 34(9): 131-134. |
[14] | . 核函数的性质及其构造方法 计算机科学, 2006, 33(6): 172-174. |
[15] | 朱美琳 刘向东 陈世福. 核方法在人脸识别中的应用 计算机科学, 2003, 30(5): 82-84. |
|