计算机科学 ›› 2022, Vol. 49 ›› Issue (3): 92-98.doi: 10.11896/jsjkx.210200047
所属专题: 大数据&数据科学 虚拟专题
夏源1, 赵蕴龙1,2, 范其林1
XIA Yuan1, ZHAO Yun-long1,2, FAN Qi-lin1
摘要: 在动态的数据流中,由于其不稳定性以及存在概念漂移等问题,集成分类模型需要有及时适应新环境的能力。目前通常使用监督信息对基分类器的权重进行更新,以此来赋予符合当前环境的基分类器更高的权重,然而监督信息在真实数据流环境下无法立即获得。为了解决这个问题,文中提出了一种基于信息熵更新基分类器权重的数据流集成分类算法。首先使用随机特征子空间对每个基分类器进行初始化来构建集成分类器;其次基于每个新到来的数据块构建一个新的基分类器来替换集成中权重最低的基分类器;然后基于信息熵的权重更新策略实时对基分类器中的权重进行更新;最后满足要求的基分类器参与加权投票,得到分类结果。将所提算法和几个经典学习算法进行对比,实验结果表明,所提方法的分类准确性有着明显优势,并且适合多种类型的概念漂移环境。
中图分类号:
[1]KRAWCZYK B,MINKU L L,GAMA J,et al.Ensemble lear-ning for data stream analysis:A survey[J].Information Fusion,2017,37:132-156. [2]KHAMASSI I,SAYED-MOUCHAWEH M,HAMMAMI M,et al.Discussion and review on evolving data streams and concept drift adapting[J].Evolving Systems,2018,9(1):1-23. [3]STREET W N,KIM Y S.A streaming ensemble algorithm(SEA) for large-scale classification[C]//Proc. of the Acm Sigkdd Int. Conference on Knowledge Discovery & Data Mining.2001:377-382. [4]WANG H,FAN W,YU P S,et al.Mining concept-drifting data streams using ensemble classifiers[C]//Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.2003:226-235. [5]BRZEZINSKI D,STEFANOWSKI J.Reacting to different types of concept drift:The accuracy updated ensemble algorithm[J].IEEE Transactions on Neural Networks and Learning Systems,2013,25(1):81-94. [6]ELWELL R,POLIKAR R.Incremental learning of concept drift in nonstationary environments[J].IEEE Transactions on Neural Networks,2011,22(10):1517-1531. [7]LV Y,PENG S,YUAN Y,et al.A classifier using online bagging ensemble method for bigdata stream learning[J].Tsinghua Science and Technology,2019,24(4):379-388. [8]KOLTER J Z,MALOOF M A.Dynamic weighted majority:An ensemble method for drifting concepts[J].Journal of Machine Learning Research,2007,8(12):2755-2790. [9]PESARANGHADER A,VIKTOR H,PAQUET E.Reservoir of diverse adaptive learners and stacking fast hoeffding drift detection methods for evolving data streams[J].Machine Learning,2018,107(11):1711-1743. [10]OLORUNNIMBE M K,VIKTOR H L,PAQUET E.Dynamic adaptation of online ensembles for drifting data streams[J].Journal of Intelligent Information Systems,2018,50(2):291-313. [11]REN S,LIAO B,ZHU W,et al.Knowledge-maximized ensemblealgorithm for different types of concept drift[J].Information Sciences,2018,430:261-281. [12]CANO A,KRAWCZYK B.Kappa Updated Ensemble for drifting data stream mining[J].Machine Learning,2020,109(1):175-218. [13]RAMÍREZ-GALLEGO S,KRAWCZYK B,GARCÍA S,et al.A survey on data preprocessing for data stream mining:Current status and future directions[J].Neurocomputing,2017,239:39-57. [14]LOSING V,HAMMER B,WERSING H.KNN classifier with self adjusting memory for heterogeneous concept drift[C]//2016 IEEE 16th International Conference on Data Mining (ICDM).IEEE,2016:291-300. [15]ZHOU Z H.Machine learning[M].Beijing:Tsinghua University Press,2016:211-214. [16]SHANNON C E.A mathematical theory of communication[J].ACM SIGMOBILE Mobile Computing and Communications Review,2001,5(1):3-55. [17]BIFET A,HOLMES G,PFAHRINGER B,et al.Moa:Massive online analysis,a framework for stream classification and clustering[C]//Proceedings of the First Workshop on Applications of Pattern Analysis.PMLR,2010:44-50. [18]DOMINGOS P,HULTEN G.Mining high-speed data streams[C]//Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2000:71-80. [19]AGRAWAL R,IMIELINSKI T,SWAMI A.Database mining:A performance perspective[J].IEEE Transactions on Knowledge and Data Engineering,1993,5(6):914-925. [20]LANGLEY P,IBA W,THOMPSON K.An analysis of Bayesian classifiers[C]//AAAI.1992:223-228. [21]OZA N C,RUSSELL S.Experimental comparisons of online and batch versions of bagging and boosting[C]//Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2001:359-364. |
[1] | 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙. 数据流概念漂移处理方法研究综述 Survey of Concept Drift Handling Methods in Data Streams 计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112 |
[2] | 周旭, 钱胜胜, 李章明, 方全, 徐常胜. 基于对偶变分多模态注意力网络的不完备社会事件分类方法 Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification 计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022 |
[3] | 胡安祥, 尹小康, 朱肖雅, 刘胜利. 基于数据流特征的比较类函数识别方法 Strcmp-like Function Identification Method Based on Data Flow Feature Matching 计算机科学, 2022, 49(9): 326-332. https://doi.org/10.11896/jsjkx.220200163 |
[4] | 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111 |
[5] | 李霞, 马茜, 白梅, 王习特, 李冠宇, 宁博. RIIM:基于独立模型的在线缺失值填补 RIIM:Real-Time Imputation Based on Individual Models 计算机科学, 2022, 49(8): 56-63. https://doi.org/10.11896/jsjkx.210600180 |
[6] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
[7] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[8] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[9] | 陈圆圆, 王志海. 基于聚类分区的多维数据流概念漂移检测方法 Concept Drift Detection Method for Multidimensional Data Stream Based on Clustering Partition 计算机科学, 2022, 49(7): 25-30. https://doi.org/10.11896/jsjkx.210600155 |
[10] | 高振卓, 王志海, 刘海洋. 嵌入典型时间序列特征的随机Shapelet森林算法 Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features 计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226 |
[11] | 杨炳新, 郭艳蓉, 郝世杰, 洪日昌. 基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用 Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition 计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070 |
[12] | 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥. 视频理解中的动作质量评估方法综述 Survey on Action Quality Assessment Methods in Video Understanding 计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028 |
[13] | 杜丽君, 唐玺璐, 周娇, 陈玉兰, 程建. 基于注意力机制和多任务学习的阿尔茨海默症分类 Alzheimer's Disease Classification Method Based on Attention Mechanism and Multi-task Learning 计算机科学, 2022, 49(6A): 60-65. https://doi.org/10.11896/jsjkx.201200072 |
[14] | 李小伟, 舒辉, 光焱, 翟懿, 杨资集. 自然语言处理在简历分析中的应用研究综述 Survey of the Application of Natural Language Processing for Resume Analysis 计算机科学, 2022, 49(6A): 66-73. https://doi.org/10.11896/jsjkx.210600134 |
[15] | 邓凯, 杨频, 李益洲, 杨星, 曾凡瑞, 张振毓. 一种可快速迁移的领域知识图谱构建方法 Fast and Transmissible Domain Knowledge Graph Construction Method 计算机科学, 2022, 49(6A): 100-108. https://doi.org/10.11896/jsjkx.210900018 |
|