计算机科学 ›› 2022, Vol. 49 ›› Issue (7): 25-30.doi: 10.11896/jsjkx.210600155

• 数据库&大数据&数据科学* 上一篇    下一篇

基于聚类分区的多维数据流概念漂移检测方法

陈圆圆, 王志海   

  1. 北京交通大学计算机与信息技术学院 北京100044
    北京交通大学交通数据分析与挖掘北京重点实验室 北京100044
  • 收稿日期:2021-06-19 修回日期:2021-12-07 出版日期:2022-07-15 发布日期:2022-07-12
  • 通讯作者: 王志海(zhhwang@bjtu.edu.cn)
  • 作者简介:(19120340@bjtu.edu.cn)
  • 基金资助:
    国家自然科学基金(61771058)

Concept Drift Detection Method for Multidimensional Data Stream Based on Clustering Partition

CHEN Yuan-yuan, WANG Zhi-hai   

  1. School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China
    Beijing Key Laboratory of Traffic Data Analysis and Mining,Beijing Jiaotong University,Beijing 100044,China
  • Received:2021-06-19 Revised:2021-12-07 Online:2022-07-15 Published:2022-07-12
  • About author:CHEN Yuan-yuan,born in 1997,master.Her main research interests include data stream mining and unsupervised learning.
    WANG Zhi-hai,born in 1963.Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include data mining and business intelligence,machine learning and computation intelligence.
  • Supported by:
    National Natural Science Foundation of China(61771058).

摘要: 对数据流中的潜在信息进行分析和利用是数据流挖掘工作的重要内容。然而,数据的分布会随着时间的推移发生变化,从而使学习假设发生更改,这就是概念漂移现象,它给数据流挖掘带来了巨大的挑战。检测数据分布的变化是一种直接且有效的概念漂移检测方法,目前,已有研究方法基于树型结构或网格结构建立直方图,实现对数据分布的描述,但是,此类方法在进行分布检测时容易产生检验盲点,其可解释性较差,并且在多维数据上的内存消耗较大。文中提出了一种基于等密度分区的概念漂移检测方法PUDC(Partition Based on Uniform Density Clusters),该方法基于改进的k-Means算法,对数据进行等密度分区,利用卡方检验对每个分区进行统计和计算,从而检测数据分布变化,以达到概念漂移检测的目的。为了验证方法的有效性,选取了4个人工数据集和3个真实数据集进行实验,对比分析了不同维度的数据下的I类错误率和II类错误率,实验结果表明,PUDC算法在多维数据流的概念漂移检测中相比几种较新的算法具有一定的优势。

关键词: k-Means, 概念漂移检测, 假设检验, 数据流挖掘, 直方图

Abstract: The analysis and utilization of potential information in data stream is an important part of data stream mining.Concept drift is a huge challenge for data stream mining that the distribution of data will change with time.Detecting changes in data distribution is a direct and effective method to detect concept drift.Currently,some concept drift detection methods use the tree structure or grid to establish a histogram to describe the data distribution.However,the tree structure is easy to produce inspection blind spots and leads to poor interpretability.While using the grid method on multi-dimensional data,the memory consumption is too much.To solve the above problems,a concept drift detection method for multi-dimensional data streams called partition based on uniform density clusters(PUDC) is proposed.The algorithm is based on the k-Means algorithm to partition the data with uniform density and uses the chi-square test for statistics and calculation of each partition to detect the concept drift.To ve-rify the validity of the method,four artificial datasets and three real datasets were selected for experiments.The type I and type II error rates of different dimensions of data were compared and analyzed.Experimental results show that PUDC algorithm is superior to several new algorithms in concept drift detection of multi-dimensional data streams.

Key words: K-Means, Concept drift detection, Data stream mining, Histogram, Hypothetical test

中图分类号: 

  • TP391
[1]BARDDAL J P,GOMES H M,ENEMBRECK F,et al.A survey on feature drift adaptation[J].Journal of Systems and Software,2017,127(C):278-294.
[2]BEUTEL A,FALOUTSOS C.User behavior modeling andfraud detection[J].IEEE Intelligent Systems,2016,31(2):84-86.
[3]MELIDIS D P,SPILIOPOULOU M,NTOUTSI E.Learning under feature drifts in textual streams[C]//Proceedings of the 27th ACM International Conference on Information and Know-ledge Management.2018:527-536.
[4]PUSCHMANN D,BARNAGHI P,TAFAZOLLI R.Adaptiveclustering for dynamic IoT data streams[J].IEEE Internet of Things Journal,2016,4(1):64-74.
[5]GAMA J,ŽLIOBAITÉ I,BIFET A,et al.A survey on concept drift adaptation[J].ACM Computing Surveys(CSUR),2014,46(4):1-37.
[6]LU J,LIU A,DONG F,et al.Learning under concept drift:A review[J].IEEE Transactions on Knowledge and Data Engineering,2018,31(12):2346-2363.
[7]HU H,KANTARDZIC M,SETHI T S.No Free Lunch Theorem for concept drift detection in streaming data classification:A review[J].Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery,2020,10(2):1327-1351.
[8]KUNCHEVA L I.Change Detection in Streaming Multivariate Data Using Likelihood Detectors[J].IEEE Annals of the His-tory of Computing,2013(5):1175-1180.
[9]BORACCHI G,CERVELLERA C,MACCIò D.Uniform histograms for change detection in multivariate data[C]//Procee-dings of the International Joint Conference on Neural Networks(IJCNN).IEEE,2017:1732-1739.
[10]BORACCHI G,CARRERA D,CERVELLERA C,et al.QuantTree:histograms for change detection in multivariate data streams[C]//Proceedings of the International Conference on Machine Learning.2018:639-648.
[11]GAMA J,MEDAS P,CASTILLO G,et al.Learning with drift detection[C]//Proceedings of Brazilian Symposium on Artificial Intelligence.Berlin:Springer,2004:286-295.
[12]LIU A,ZHANG G,LU J.Fuzzy time windowing for gradual concept drift adaptation[C]//Proceedings of IEEE International Conference on Fuzzy Systems(FUZZ-IEEE).IEEE,2017:1-6.
[13]BIFET A,READ J,ŽLIOBAITÉ I,et al.Pitfalls in benchmar-king data stream classification and how to avoid them[C]//Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Berlin:Springer,2013:465-479.
[14]BASSEVILLE M,NIKIFOROV I V.Detection of abrupt changes:theory and application[M].Florida:Englewood Cliffs:Prentice Hall,1993.
[15]ALIPPI C,BORACCHI G,CARRERA D,et al.Change detection in multivariate datastreams:Likelihood and detectability loss[C]//Proceedings of the International Joint Conference on Artificial Intelligence(IJCAI).SF:Morgan Kaufmann,2016:1368-1374.
[16]GAMA J.Knowledge discovery from data streams[M].London:CRC Press,2010.
[17]SILVERMAN B W.Density estimation for statistics and data analysis[M].Routledge,2018.
[18]SHENG Z,XIE S Q,PAN C Y.Probability theory and mathematical statistics[M].Beijing:Higher Education Press,2008.
[19]HE J R,DING L X,HU Q,et al.Properties of high-dimensional data space and Metric choice[J].Journal of Computer Science,2014,3(41):212-217.
[20]CARRERA D,BORACCHI G.Generating high-dimensionaldatastreams for change detection[J].Big Data Research,2018,11:11-21.
[21]LIU A,LU J,ZHANG G.Concept Drift Detection via Equal Intensity k-Means Space Partitioning[J].IEEE Transactions on Cybernetics,2020,51(6):3198-3211.
[22]DOS REIS D M,FLACH P,MATWIN S,et al.Fast unsupervised online drift detection using incremental kolmogorov-smirnov test[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:1545-1554.
[1] 郝洁, 平萍, 付德银, 赵红泽.
压缩差值后的双直方图平移可逆信息隐藏方法
Bi-histogram Shifting Reversible Data Hiding Method After Compressed Differences
计算机科学, 2022, 49(9): 340-346. https://doi.org/10.11896/jsjkx.220300238
[2] 曹扬晨, 朱国胜, 孙文和, 吴善超.
未知网络攻击识别关键技术研究
Study on Key Technologies of Unknown Network Attack Identification
计算机科学, 2022, 49(6A): 581-587. https://doi.org/10.11896/jsjkx.210400044
[3] 杨旭华, 王磊, 叶蕾, 张端, 周艳波, 龙海霞.
基于节点相似性和网络嵌入的复杂网络社区发现算法
Complex Network Community Detection Algorithm Based on Node Similarity and Network Embedding
计算机科学, 2022, 49(3): 121-128. https://doi.org/10.11896/jsjkx.210200009
[4] 孔钰婷, 谭富祥, 赵鑫, 张正航, 白璐, 钱育蓉.
基于差分隐私的K-means算法优化研究综述
Review of K-means Algorithm Optimization Based on Differential Privacy
计算机科学, 2022, 49(2): 162-173. https://doi.org/10.11896/jsjkx.201200008
[5] 淡州阳, 刘粉林, 巩道福.
基于差分直方图中尾部信息的平滑滤波检测算法
Smoothing Filter Detection Algorithm Based on Middle and Tail Information of Differential Histogram
计算机科学, 2021, 48(11): 234-241. https://doi.org/10.11896/jsjkx.200900121
[6] 金雨芳, 吴祥, 董辉, 俞立, 张文安.
基于改进YOLO v4的安全帽佩戴检测算法
Improved YOLO v4 Algorithm for Safety Helmet Wearing Detection
计算机科学, 2021, 48(11): 268-275. https://doi.org/10.11896/jsjkx.200900098
[7] 谭建豪, 殷旺, 刘力铭, 王耀南.
采用多相关滤波策略的鲁棒长时自适应目标跟踪
Robust Long-term Adaptive Object Tracking Based onMulti-correlation Filtering Strategy
计算机科学, 2020, 47(12): 169-176. https://doi.org/10.11896/jsjkx.191000021
[8] 丁荣莉, 李杰, 张曼, 刘艳丽, 伍伟.
基于S-HOG的遥感图像舰船目标检测
Ship Target Detection in Remote Sensing Image Based on S-HOG
计算机科学, 2020, 47(11A): 248-252. https://doi.org/10.11896/jsjkx.191200090
[9] 王恰, 戚湧.
基于帧间差分和统计直方图的交通视频背景建模方法
Method for Traffic Video Background Modeling Based on Inter-frame Difference and Statistical Histogram
计算机科学, 2020, 47(10): 174-179. https://doi.org/10.11896/jsjkx.190800014
[10] 郭兰英, 韩睿之, 程鑫.
基于可变形卷积神经网络的数字仪表识别方法
Digital Instrument Identification Method Based on Deformable Convolutional Neural Network
计算机科学, 2020, 47(10): 187-193. https://doi.org/10.11896/jsjkx.191000035
[11] 王晓, 邹泽伟, 李勃勃, 王静.
基于多特征融合的彩色图像声呐目标检测
Target Detection in Colorful Imaging Sonar Based on Multi-feature Fusion
计算机科学, 2019, 46(6A): 177-181.
[12] 范蓉蓉, 樊佳庆, 刘青山.
实时高置信度更新补充学习跟踪
Real-time High-confidence Update Complementary Learner Tracking
计算机科学, 2019, 46(3): 137-141. https://doi.org/10.11896/j.issn.1002-137X.2019.03.020
[13] 贾洪杰, 王良君, 宋和平.
HMRF半监督近似核k-means算法
HMRF Semi-supervised Approximate Kernel k-means Algorithm
计算机科学, 2019, 46(12): 31-37. https://doi.org/10.11896/jsjkx.190600159
[14] 杨秀璋, 夏换, 于小民.
一种基于水族濒危文字的图像增强及识别方法
Image Enhancement and Recognition Method Based on Shui-characters
计算机科学, 2019, 46(11A): 324-328.
[15] 毛峡, 王岚, 李建军.
一种基于RGB-D特征融合的人体行为识别框架
Human Action Recognition Framework with RGB-D Features Fusion
计算机科学, 2018, 45(8): 22-27. https://doi.org/10.11896/j.issn.1002-137X.2018.08.005
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!