计算机科学 ›› 2022, Vol. 49 ›› Issue (3): 92-98.doi: 10.11896/jsjkx.210200047

所属专题: 大数据&数据科学 虚拟专题

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于信息熵更新权重的数据流集成分类算法

夏源1, 赵蕴龙1,2, 范其林1   

  1. 1 南京航空航天大学计算机科学与技术学院 南京211106
    2 软件新技术与产业化协同创新中心 南京210023
  • 收稿日期:2021-02-04 修回日期:2021-07-08 出版日期:2022-03-15 发布日期:2022-03-15
  • 通讯作者: 赵蕴龙(zhaoyunlong@nuaa.edu.cn)
  • 作者简介:(xiayuan@nuaa.edu.cn)

Data Stream Ensemble Classification Algorithm Based on Information Entropy Updating Weight

XIA Yuan1, ZHAO Yun-long1,2, FAN Qi-lin1   

  1. 1 School of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China
    2 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210023,China
  • Received:2021-02-04 Revised:2021-07-08 Online:2022-03-15 Published:2022-03-15
  • About author:XIA Yuan,born in 1995,postgraduate.His main research interests include data mining and so on.
    ZHAO Yun-long,born in 1975,Ph.D,professor,is a member of China Computer Federation.His main research interests include pervasive computing,collective computing,wearable computing and swarm intelligence.

摘要: 在动态的数据流中,由于其不稳定性以及存在概念漂移等问题,集成分类模型需要有及时适应新环境的能力。目前通常使用监督信息对基分类器的权重进行更新,以此来赋予符合当前环境的基分类器更高的权重,然而监督信息在真实数据流环境下无法立即获得。为了解决这个问题,文中提出了一种基于信息熵更新基分类器权重的数据流集成分类算法。首先使用随机特征子空间对每个基分类器进行初始化来构建集成分类器;其次基于每个新到来的数据块构建一个新的基分类器来替换集成中权重最低的基分类器;然后基于信息熵的权重更新策略实时对基分类器中的权重进行更新;最后满足要求的基分类器参与加权投票,得到分类结果。将所提算法和几个经典学习算法进行对比,实验结果表明,所提方法的分类准确性有着明显优势,并且适合多种类型的概念漂移环境。

关键词: 分类, 概念漂移, 集成算法, 数据流, 信息熵

Abstract: In the dynamic data stream,due to its instability and the existence of concept drift,the ensemble classification model needs the ability to adapt to the new environment in time.At present,the weight of the base classifier is usually updated by using the supervision information,so as to give higher weight to the base classifier suitable for the current environment.However,supervision information cannot be obtained immediately in a real data stream environment.In order to solve this problem,this paper presents a data stream ensemble classification algorithm,which updates the weight of the base classifier through information entropy.Firstly,the random feature subspace is used to initialize each base classifier to construct the ensemble classifier.Secondly,a new base classifier is constructed based on each new data block to replace the base classifier with the lowest weight in the ensemble.Then,the weight update strategy based on information entropy will update the weights in the base classifier in real time.Finally,the base classifier that meets the requirements participates in weighted voting to obtain the classification result.Comparing the proposed algorithm with several other classic learning algorithms,the experimental results show that the proposed me-thod has obvious advantages in classification accuracy and is suitable for various types of concept drift environments.

Key words: Classification, Concept drift, Data stream, Ensemble algorithm, Information entropy

中图分类号: 

  • TP391
[1]KRAWCZYK B,MINKU L L,GAMA J,et al.Ensemble lear-ning for data stream analysis:A survey[J].Information Fusion,2017,37:132-156.
[2]KHAMASSI I,SAYED-MOUCHAWEH M,HAMMAMI M,et al.Discussion and review on evolving data streams and concept drift adapting[J].Evolving Systems,2018,9(1):1-23.
[3]STREET W N,KIM Y S.A streaming ensemble algorithm(SEA) for large-scale classification[C]//Proc. of the Acm Sigkdd Int. Conference on Knowledge Discovery & Data Mining.2001:377-382.
[4]WANG H,FAN W,YU P S,et al.Mining concept-drifting data streams using ensemble classifiers[C]//Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.2003:226-235.
[5]BRZEZINSKI D,STEFANOWSKI J.Reacting to different types of concept drift:The accuracy updated ensemble algorithm[J].IEEE Transactions on Neural Networks and Learning Systems,2013,25(1):81-94.
[6]ELWELL R,POLIKAR R.Incremental learning of concept drift in nonstationary environments[J].IEEE Transactions on Neural Networks,2011,22(10):1517-1531.
[7]LV Y,PENG S,YUAN Y,et al.A classifier using online bagging ensemble method for bigdata stream learning[J].Tsinghua Science and Technology,2019,24(4):379-388.
[8]KOLTER J Z,MALOOF M A.Dynamic weighted majority:An ensemble method for drifting concepts[J].Journal of Machine Learning Research,2007,8(12):2755-2790.
[9]PESARANGHADER A,VIKTOR H,PAQUET E.Reservoir of diverse adaptive learners and stacking fast hoeffding drift detection methods for evolving data streams[J].Machine Learning,2018,107(11):1711-1743.
[10]OLORUNNIMBE M K,VIKTOR H L,PAQUET E.Dynamic adaptation of online ensembles for drifting data streams[J].Journal of Intelligent Information Systems,2018,50(2):291-313.
[11]REN S,LIAO B,ZHU W,et al.Knowledge-maximized ensemblealgorithm for different types of concept drift[J].Information Sciences,2018,430:261-281.
[12]CANO A,KRAWCZYK B.Kappa Updated Ensemble for drifting data stream mining[J].Machine Learning,2020,109(1):175-218.
[13]RAMÍREZ-GALLEGO S,KRAWCZYK B,GARCÍA S,et al.A survey on data preprocessing for data stream mining:Current status and future directions[J].Neurocomputing,2017,239:39-57.
[14]LOSING V,HAMMER B,WERSING H.KNN classifier with self adjusting memory for heterogeneous concept drift[C]//2016 IEEE 16th International Conference on Data Mining (ICDM).IEEE,2016:291-300.
[15]ZHOU Z H.Machine learning[M].Beijing:Tsinghua University Press,2016:211-214.
[16]SHANNON C E.A mathematical theory of communication[J].ACM SIGMOBILE Mobile Computing and Communications Review,2001,5(1):3-55.
[17]BIFET A,HOLMES G,PFAHRINGER B,et al.Moa:Massive online analysis,a framework for stream classification and clustering[C]//Proceedings of the First Workshop on Applications of Pattern Analysis.PMLR,2010:44-50.
[18]DOMINGOS P,HULTEN G.Mining high-speed data streams[C]//Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2000:71-80.
[19]AGRAWAL R,IMIELINSKI T,SWAMI A.Database mining:A performance perspective[J].IEEE Transactions on Knowledge and Data Engineering,1993,5(6):914-925.
[20]LANGLEY P,IBA W,THOMPSON K.An analysis of Bayesian classifiers[C]//AAAI.1992:223-228.
[21]OZA N C,RUSSELL S.Experimental comparisons of online and batch versions of bagging and boosting[C]//Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2001:359-364.
[1] 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙.
数据流概念漂移处理方法研究综述
Survey of Concept Drift Handling Methods in Data Streams
计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112
[2] 周旭, 钱胜胜, 李章明, 方全, 徐常胜.
基于对偶变分多模态注意力网络的不完备社会事件分类方法
Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification
计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022
[3] 胡安祥, 尹小康, 朱肖雅, 刘胜利.
基于数据流特征的比较类函数识别方法
Strcmp-like Function Identification Method Based on Data Flow Feature Matching
计算机科学, 2022, 49(9): 326-332. https://doi.org/10.11896/jsjkx.220200163
[4] 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航.
监督和半监督学习下的多标签分类综述
Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning
计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[5] 李霞, 马茜, 白梅, 王习特, 李冠宇, 宁博.
RIIM:基于独立模型的在线缺失值填补
RIIM:Real-Time Imputation Based on Individual Models
计算机科学, 2022, 49(8): 56-63. https://doi.org/10.11896/jsjkx.210600180
[6] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[7] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[8] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[9] 陈圆圆, 王志海.
基于聚类分区的多维数据流概念漂移检测方法
Concept Drift Detection Method for Multidimensional Data Stream Based on Clustering Partition
计算机科学, 2022, 49(7): 25-30. https://doi.org/10.11896/jsjkx.210600155
[10] 高振卓, 王志海, 刘海洋.
嵌入典型时间序列特征的随机Shapelet森林算法
Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features
计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[11] 杨炳新, 郭艳蓉, 郝世杰, 洪日昌.
基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用
Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition
计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070
[12] 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥.
视频理解中的动作质量评估方法综述
Survey on Action Quality Assessment Methods in Video Understanding
计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028
[13] 杜丽君, 唐玺璐, 周娇, 陈玉兰, 程建.
基于注意力机制和多任务学习的阿尔茨海默症分类
Alzheimer's Disease Classification Method Based on Attention Mechanism and Multi-task Learning
计算机科学, 2022, 49(6A): 60-65. https://doi.org/10.11896/jsjkx.201200072
[14] 李小伟, 舒辉, 光焱, 翟懿, 杨资集.
自然语言处理在简历分析中的应用研究综述
Survey of the Application of Natural Language Processing for Resume Analysis
计算机科学, 2022, 49(6A): 66-73. https://doi.org/10.11896/jsjkx.210600134
[15] 邓凯, 杨频, 李益洲, 杨星, 曾凡瑞, 张振毓.
一种可快速迁移的领域知识图谱构建方法
Fast and Transmissible Domain Knowledge Graph Construction Method
计算机科学, 2022, 49(6A): 100-108. https://doi.org/10.11896/jsjkx.210900018
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!