计算机科学 ›› 2019, Vol. 46 ›› Issue (1): 64-72.doi: 10.11896/j.issn.1002-137X.2019.01.010

• 2018 年第七届中国数据挖掘会议 • 上一篇    下一篇

概念漂移数据流分类中的多源在线迁移学习算法

秦一休1, 文益民1,2, 何倩1   

  1. (桂林电子科技大学计算机与信息安全学院 广西 桂林541004)1
    (广西可信软件重点实验室桂林电子科技大学 广西 桂林541004)2
  • 收稿日期:2018-06-02 出版日期:2019-01-15 发布日期:2019-02-25
  • 作者简介:秦一休(1992-),男,硕士,主要研究方向为机器学习、迁移学习;文益民(1969-),男,博士,教授,主要研究方向为机器学习、数据挖掘与推荐系统,E-mail:ymwen2004@aliyun.com(通信作者);何 倩(1979-),男,博士,教授,主要研究方向为云计算、分布式计算和信息安全。
  • 基金资助:
    国家自然科学基金(61363029,61866007),广西区自然科学基金(2018GXNSFDA138006),广西可信软件重点实验室立项资助课题(KX201721),广西高校图像图形智能处理重点实验室课题资助项目(GIIP201505),广西云计算与大数据协同创新中心项目(YD16E12)资助

Multi-source Online Transfer Learning Algorithm for Classification of Data Streams with Concept Drift

QIN Yi-xiu1, WEN Yi-min1,2, HE Qian1   

  1. (School of Computer Science and Information Security,Guilin University of Electronic Technology,Guilin,Guangxi 541004,China)1
    (Guangxi Key Laboratory of Trustworthy Software,Guilin University of Electronic Technology,Guilin,Guangxi 541004,China)2
  • Received:2018-06-02 Online:2019-01-15 Published:2019-02-25

摘要: 现有概念漂移处理算法在检测到概念漂移发生后,通常需要在新到概念上重新训练分类器,同时“遗忘”以往训练的分类器。在概念漂移发生初期,由于能够获取到的属于新到概念的样本较少,导致新建的分类器在短时间内无法得到充分训练,分类性能通常较差。进一步,现有的基于在线迁移学习的数据流分类算法仅能使用单个分类器的知识辅助新到概念进行学习,在历史概念与新到概念相似性较差时,分类模型的分类准确率不理想。针对以上问题,文中提出一种能够利用多个历史分类器知识的数据流分类算法——CMOL。CMOL算法采取分类器权重动态调节机制,根据分类器的权重对分类器池进行更新,使得分类器池能够尽可能地包含更多的概念。实验表明,相较于其他相关算法,CMOL算法能够在概念漂移发生时更快地适应新到概念,显示出更高的分类准确率。

关键词: 多源迁移学习, 在线学习, 概念漂移, 数据流分类

Abstract: The existing algorithms for classification of data streams with concept drift always train a new classifier on new collected data when new concept is detected,and forget the historical models.This strategy always lead to insufficient training of classifier in a short time,because the training data for the new concept are always not collected enough in initial stage.And further,some existing online transfer learning algorithms for classification of data streams with concept drift only take advantage of single source domain,which sometimes lead to poor classification accuracy when the historical concepts are different with the new concept.Aiming to solve these problems above,this paper proposed a multi-source online transfer learning algorithms for classification of data stream with concept drift (CMOL),which can utilize the knowledges from multiple historical classifiers.The CMOL algorithm adopts a dynamic classifier weight adjustment mechanism and updates classifier pool according to the weights of classifiers in it.Experiments validate that CMOL can adapt to new concept faster than other corresponding methods when concept drift occurs,and get higher classification accuracy.

Key words: Multi-source transfer learning, Online learning, Concept drift, Data stream classification

中图分类号: 

  • TP391
[1]SCHLIMMER J C,GRANGER R H.Incremental Learning from Noisy Data[J].Machine Learning,1986,1(3):317-354.<br /> [2]HULTEN G,SPENCER L,DOMINGOS P.Mining time-changing data streams[C]//Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,2001:97-106.<br /> [3]KOLTER J Z,MALOOF M A.Dynamic weighted majority:a new ensemble method for tracking concept drift[C]//Procee-dings of the IEEE Conference on Data Mining.Piscataway:IEEE,2003:123-130.<br /> [4]JR P M G,BARROS R S M D.RCD:A recurring concept drift framework[J].Pattern Recognition Letters,2013,34(9):1018-1025.<br /> [5]LI P,WU X,HU X.Mining recurring concept drifts with limited labeled streaming data[C]//Proceedings of the 2nd Asian Conference on Machine Learning.New York:ACM,2010:241-252.<br /> [6]ZHAO P,HOI S C H,WANG J,et al.Online Transfer Learning[J].Journal of Artificial Intelligence,2014,216(16):76-102.<br /> [7]WEN Y M,TANG S Q,FENG C,et al.Online Transfer Learning for Mining Recurring Concept in Data Stream Classification[J].Journal of Computer Research and Development,2016,53(8):1781-1791.(in Chinese)<br /> 文益民,唐诗淇,冯超,等.基于在线迁移学习的重现概念漂移数据流分类[J].计算机研究与发展,2016,53(8):1781-1791.<br /> [8]WEN Y M,QIANG B H,FAN Z G.A survey of the classification of data streams with concept drift[J].CAAI Transactions on Intelligent Systems,2013,8(2):95-104.(in Chinese)<br /> 文益民,强保华,范志刚.概念漂移数据流分类研究综述[J].智能系统学报,2013,8(2):95-104.<br /> [9]ZLIOBAITE I,PECHENIZKIY M,GAMA J.An overview of concept drift applications[J].Studies in Big Data,2016,16(1):91-114.<br /> [10]KRAWCZYK B,MINKU L L,GAMA J,et al.Ensemble learning for data stream analysis:A survey[J].Information Fusion,2017,37(C):132-156.<br /> [11]GAMA J,ZLIOBAITE I,BIFET A,et al.A survey on concept drift adaptation[J].ACM Computing Surveys (CSUR),2014,46(4):1-37.<br /> [12]CASTILLO G,GAMA J,BREDA A M.Adaptive bayes for a student modeling prediction task based on learning styles[C]//Proceedings of the International Conference on User Modeling.BerLin:Springer,2003:328-332.<br /> [13]KUKAR M.Drifting Concepts as Hidden Factors in Clinical Studies[M]//Artificial Intelligence in Medicine.Berlin:Sprin-ger,2003:28-35.<br /> [14]ZHUANG F Z,LUO P,HE Q,et al.Survey on transfer learning research[J].Journal of Software,2015,26(1):26-39.(in Chinese)<br /> 庄福振,罗平,何清,等.迁移学习研究进展[J].软件学报,2015,26(1):26-39.<br /> [15]LU L L,ZHANG Y P,TAN H Y,et al.Research on classification algorithm and concept drift based on big data[J].Journal of Frontiers of Computer Science & Technology,2016,10(12):1683-1692.(in Chinese)<br /> 陆莉莉,张永潘,谈海宇,等.大数据分类挖掘算法及其概念漂移应用研究[J].计算机科学与探索,2016,10(12):1683-1692.<br /> [16] LI Y,ZHANG Y H,HU X G,et al.Classification Algorithm for Data Stream Based on Mixture Models of C4.5 and NB[J].Computer Science,2010,37(12):138-142.(in Chinese)<br /> 李燕,张玉红,胡学钢,等.基于C4.5和NB混合模型的数据流分类算法[J].计算机科学,2010,37(12):138-142.<br /> [17]VINAYAGA SUNDARAM B,AARIHI R J,SARANYA P A.Efficient Gaussian Decision Tree method for Concept drift data stream[C]//Proceedings of the International Conference on Signal Processing,Communication and Networking.Piscataway:IEEE,2015:1-5.<br /> [18]STREET W N.A streaming ensemble algorithm (SEA) for large-scale classification[C]//Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mi-ning.New York:ACM,2001:377-382.<br /> [19]RAMAMURTHY S,BHATNAGAR R.Tracking Recurrent Concept Drift in Streaming Data Using Ensemble Classifiers[C]//Proceedings of the International Conference on Machine Lear-ning and Applications.IEEE:NJ,2007:404-409.<br /> [20]BRZEZINSKI D,STEFANOWSKI J.Reacting to different types of concept drift:The Accuracy Updated Ensemble algorithm[J].IEEE Transactions on Neural Networks & Learning Systems,2014,25(1):81-94.<br /> [21]SUN Y,TANG K,ZHU Z,et al.Concept Drift Adaptation by Exploiting Historical Knowledge[J].IEEE Transactions on Neural Networks & Learning Systems,2017,PP(99):1-11.<br /> [22]XIN Y,GUO G D,CHEN L F,et al.IKnnM-DHecoc:A Method for Handling the Problem of Concept Drift[J].Journal of Computer Research and Development,2011,48(4):592-601.(in Chinese)<br /> 辛轶,郭躬德,陈黎飞,等.IKnnM-DHecoc:一种解决概念漂移问题的方法[J].计算机研究与发展,2011,48(4):592-601.<br /> [23]WEISS K,KHOSHGOFTAAR T M,WANG D D.A survey of transfer learning[J].Journal of Big Data,2016,3(1):9.<br /> [24]SUN S,SHI H,WU Y.A survey of multi-source domain adaptation[J].Journal of Information Fusion,2015,24(C):84-92.<br /> [25]PAN S J,YANG Q.A Survey on Transfer Learning[J].IEEE Transactions on Knowledge And Data Engineering,2010,22(10):1345-1359.<br /> [26]WU Q,WU H,ZHOU X,et al.Online transfer learning with multiple homogeneous or heterogeneous sources[J].IEEE Transactions on Knowledge and Data Engineering,2017,29(7):1494-1507.<br /> [27]TANG S Q,WEN Y M,QIN Y X,et al.Online Transfer Learning from Multiple Sources Based on Local Classification Accuracy[J].Journal of Software,2017,28(11):2940-2960.(in Chinese)<br /> 唐诗淇,文益民,秦一休,等.一种基于局部分类精度的多源在线迁移学习算法[J].软件学报,2017,28(11):2940-2960.<br /> [28]BIFET A,HOLMES G,KIRKBY R,et al.MOA:Massive Online Analysis[J].Journal of Machine Learning Research,2010,11(2):1601-1604.
[1] 刘凌云, 钱辉, 邢红杰, 董春茹, 张峰. 一种基于Q-学习算法的增量分类模型[J]. 计算机科学, 2020, 47(8): 171-177.
[2] 孔芳, 李奇之, 李帅. 在线影响力最大化研究综述[J]. 计算机科学, 2020, 47(5): 7-13.
[3] 何孝文, 胡一飞, 王海平, 陈默. 在线学习非负矩阵分解[J]. 计算机科学, 2019, 46(6A): 473-477.
[4] 李德权,董翘,周跃进. 分布式在线条件梯度优化算法[J]. 计算机科学, 2019, 46(3): 332-337.
[5] 杨海民, 潘志松, 白玮. 时间序列预测方法综述[J]. 计算机科学, 2019, 46(1): 21-28.
[6] 陈晋音, 方航, 林翔, 郑海斌, 杨东勇, 周晓. 基于在线学习行为分析的个性化学习推荐[J]. 计算机科学, 2018, 45(11A): 422-426.
[7] 赵强利,蒋艳凰. 类别严重不均衡应用的在线数据流学习算法[J]. 计算机科学, 2017, 44(6): 255-259.
[8] 王长宝,李青雯,于化龙. 面向类别不平衡数据的主动在线加权极限学习机算法[J]. 计算机科学, 2017, 44(12): 221-226.
[9] 薛伟,张文生,任俊宏. 基于随机谱梯度的在线学习[J]. 计算机科学, 2016, 43(9): 47-51.
[10] 陈小东,孙力娟,韩崇,郭剑. 基于模糊聚类的数据流概念漂移检测算法[J]. 计算机科学, 2016, 43(4): 219-223.
[11] 张玉红,陈伟,胡学钢. 一种面向不完全标记的文本数据流自适应分类方法[J]. 计算机科学, 2016, 43(12): 179-182.
[12] 徐树良,王俊红. 基于Kappa系数的数据流分类算法[J]. 计算机科学, 2016, 43(12): 173-178.
[13] 丁剑,韩萌,李娟. 概念漂移数据流挖掘算法综述[J]. 计算机科学, 2016, 43(12): 24-29.
[14] 石中伟,文益民. 基于概率相关性的多标签数据流变化检测[J]. 计算机科学, 2015, 42(8): 60-64.
[15] 韩法旺,刘耀宗. 数据流分类挖掘中的概念变化研究[J]. 计算机科学, 2014, 41(Z11): 347-350.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 编辑部. 新网站开通,欢迎大家订阅![J]. 计算机科学, 2018, 1(1): 1 .
[2] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[3] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[4] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[5] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[6] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[7] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[8] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[9] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[10] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .