计算机科学 ›› 2025, Vol. 52 ›› Issue (6): 139-150.doi: 10.11896/jsjkx.240300155

• 数据库&大数据&数据科学 • 上一篇    下一篇

标签稀疏场景下任意数据流在线学习方法

张帅, 周鹏, 张燕平   

  1. 安徽大学计算机科学与技术学院 合肥 230601
  • 收稿日期:2024-03-25 修回日期:2024-09-11 出版日期:2025-06-15 发布日期:2025-06-11
  • 通讯作者: 周鹏(doodzhou@ahu.edu.cn)
  • 作者简介:(zs0920ahu@163.com)
  • 基金资助:
    国家自然科学基金面上项目(62376001);安徽省自然科学基金面上项目(2308085MF215)

Online Capricious Data Stream Learning with Sparse Labels

ZHANG Shuai, ZHOU Peng, ZHANG Yanping   

  1. School of Computer Science and Technology,Anhui University,Hefei 230601,China
  • Received:2024-03-25 Revised:2024-09-11 Online:2025-06-15 Published:2025-06-11
  • About author:ZHANG Shuai,born in 1996,postgradua-te,is a student member of CCF(No.U3918G).His main research interests include data streams and online lear-ning.
    ZHOU Peng,born in 1987,Ph.D,is a member of CCF(No.K6292M).His main research interests include data mining and machine learning.
  • Supported by:
    National Natural Science Foundation of China(62376001) and Natural Science Foundation of Anhui Province,China(2308085MF215).

摘要: 随着数据体量的剧增,机器学习方法已逐渐由传统的静态学习模式转向面向流式数据的在线学习模式。任意数据流是指数据实例随着时间以流的方式逐个到达的同时,其特征空间可能会发生任意变化,即旧的特征可能随时消失,新的特征也可能随时出现。例如,在环境检测领域,新增传感器或旧传感器突然异常会使得数据流的特征空间发生任意变化。此外,现有面向数据流的在线学习方法大多假设可以获取所有数据实例的真实标签。然而,在真实应用中,由于人工标注数据的代价高昂,数据标签大多是稀疏的。为了解决标签稀疏场景下任意数据流的在线学习问题,提出一种基于被动-主动学习的在线学习算法PAACDS(Passive Aggressive Active Learning for Capricious Data Streams)以及它的变体PAACDS-I。首先,利用在线主动学习方法选择有价值的数据实例,使得可以在最小的监督下建立优越的预测模型。随后,在获得所选择数据实例的查询标签后,结合在线被动-主动更新规则和边界最大化原则来更新基于任意数据流中共享和新增特征空间的动态分类器。最后,将所提算法与现有的最先进方法在12个数据集上进行了比较,大量的实验对比和分析验证了所提算法在任意数据流标签稀疏场景下的有效性。

关键词: 在线学习, 任意数据流, 动态特征空间, 主动学习, 稀疏标签

Abstract: With the dramatic increase in data volume,machine learning methods have gradually transitioned from traditional static learning to online learning modes that are designed for streaming data.Capricious data streams refer to data instances arriving over time in a sequential manner,where the feature space can potentially undergo capricious changes.It means that old features may disappear at any time,while new features may emerge.For example,in the field of environmental monitoring,the addition of new sensors or sudden anomalies in existing sensors can cause arbitrary changes in the feature space of the data stream.Furthermore,existing online learning methods for data streams often assume access to the true labels of all data instances.However,in real-world applications,data labeling is often sparse due to the high cost of manual annotation.Therefore,to address the problem of online learning in capricious data streams with sparse labels,a passive-active learning-based online learning algorithm called PAACDS(Passive Aggressive Active Learning for Capricious Data Streams),along with its variant PAACDS-I,is proposed.Firstly,an online active learning method is utilized to select valuable data instances,allowing the construction of superior prediction models with minimal supervision.Subsequently,after obtaining the queried labels for the selected data instances,the dynamic classifier,which encompasses the shared and newly added feature spaces in the capricious data streams,is updated using online passive-active update rules and the principle of boundary maximization.Finally,the proposed algorithm is compared to existing state-of-the-art methods on twelve datasets.Extensive experimental comparisons and analyses validate the effectiveness of the proposed algorithm in scenarios involving capricious data streams and sparse labels.

Key words: Online learning, Capricious data streams, Dynamic feature space, Active learning, Sparse label

中图分类号: 

  • TP391
[1]ZHAO P,WANG D,WU P,et al.A unified framework forsparse online learning[J].ACM Transactions on Knowledge Discovery from Data(TKDD),2020,14(5):1-20.
[2]ZHAO Q L,JIANG Y H.Online Data Stream Mining for Seriously Unbalanced Applications[J].Computer Science,2017,44(6):255-259.
[3]DE LANGE M,TUYTELAARS T.Continual prototype evolution:Learning online from non-stationary data streams[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:8250-8259.
[4]VIDHYA M,AJI S.Parallelized extreme learning machine foronline data classification[J].Applied Intelligence,2022,52(12):14164-14177.
[5]FU X,SEO E,CLARKE J,et al.Link prediction under imperfect detection:Collaborative filtering for ecological networks[J].IEEE Transactions on Knowledge and Data Engineering,2019,33(8):3117-3128.
[6]PHADKE A,KULKARNI M,BHAWALKAR P,et al.A review of machine learning methodologies for network intrusion detection[C]//2019 3rd International Conference on Computing Methodologies and Communication(ICCMC).IEEE,2019:272-275.
[7]ULLO S L,SINHA G R.Advances in smart environment monitoring systems using IoT and sensors[J].Sensors,2020,20(11):3113.
[8]HE Y,WU B,WU D,et al.Online learning from capricious datastreams:a generative approach[C]//International Joint Confe-rence on Artificial Intelligence Main Track.2019.
[9]YOU D,XIAO J,WANG Y,et al.Online learning from incomplete and imbalanced data streams[J].IEEE Transactions on Knowledge and Data Engineering,2023,35(10):10650-10665.
[10]ZHANG D,JIN M,CAO P.ST-Meta Diagnosis:Meta learningwith Spatial Transform for rare skin disease Diagnosis[C]//2020 IEEE International Conference on Bioinformatics and Biomedicine(BIBM).IEEE,2020:2153-2160.
[11]ZHOU Y,REN H,LI Z,et al.Anomaly detection via a combination model in time series data[J].Applied Intelligence,2021,51:4874-4887.
[12]LU J,LIU A,DONG F,et al.Learning under concept drift:A review[J].IEEE Transactions on Knowledge and Data Engineering,2018,31(12):2346-2363.
[13]AGRAHARI S,SINGH A K.Concept drift detection in data stream mining:A literature review[J].Journal of King Saud University-Computer and Information Sciences,2022,34(10):9523-9540.
[14]LI H,FANG C,LIN Z.Accelerated first-order optimization algorithms for machine learning[C]//Proceedings of the IEEE.2020:2067-2082.
[15]ZINKEVICH M.Online convex programming and generalizedinfinitesimal gradient ascent[C]//Proceedings of the 20th International Conference on Machine Learning(ICML-03).2003:928-936.
[16]CRAMMER K,LEE D.Learning via gaussian herding[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems.2010:451-459.
[17]CRAMMER K,DREDZE M,KULESZA A.Multi-class confidence weighted algorithms[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing.2009:496-504.
[18]CHEN Z,ZHAN H,SHENG V,et al.Projection dual averaging based second-order online learning[C]//2022 IEEE InternationalConference on Data Mining(ICDM).IEEE,2022:51-60.
[19]ZHANG Q,ZHANG P,LONG G,et al.Online learning from trapezoidal data streams[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(10):2709-2723.
[20]GU S,QIAN Y,HOU C.Learning with incremental instances and features[J].IEEE Transactions on Neural Networks and Learning Systems,2023,35(7):9713-9727.
[21]YU E,LU J,ZHANG B,et al.Online boosting adaptive learning under concept drift for multistream classification[C]//Procee-dings of the AAAI Conference on Artificial Intelligence.2024:16522-16530.
[22]BEYAZIT E,ALAGURAJAH J,WU X.Online learning from data streams with varying feature spaces[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:3232-3239.
[23]HE Y,WU B,WU D,et al.Toward mining capricious datastreams:A generative approach[J].IEEE Transactions on Neural Networks and Learning Systems,2020,32(3):1228-1240.
[24]GU S,QIAN Y,HOU C.Incremental feature spaces learning with label scarcity[J].ACM Transactions on Knowledge Discovery from Data(TKDD),2022,16(6):1-26.
[25]LIU Y,FAN X,LI W,et al.Online passive-aggressive active learning for trapezoidal data streams[J].IEEE Transactions on Neural Networks and Learning Systems,2022,34(10):6725-6739.
[26]CHENG J,ZHENG Z,GUO Y,et al.Active broad learning with multi-objective evolution for data stream classification[J].Complex & Intelligent Systems,2024,10(1):899-916.
[27]GU S,LUO T,HE M,et al.Online Learning With Incremental Feature Space and Bandit Feedback[J].IEEE Transactions on Knowledge and Data Engineering,2023,35(12):12902-12916.
[28]DIN S U,ULLAH A,MAWULI C B,et al.A reliable adaptive prototype-based learning for evolving data streams with limited labels[J].Information Processing & Management,2024,61(1):103532.
[29]HAO S,LU J,ZHAO P,et al.Second-order online active lear-ning and its applications[J].IEEE Transactions on Knowledge and Data Engineering,2017,30(7):1338-1351.
[30]LIN X.Dual averaging method for regularized stochastic lear-ning and online optimization[J].The Journal of Machine Lear-ning Research,2010,11:2543-2596.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!