计算机科学 ›› 2021, Vol. 48 ›› Issue (11): 184-191.doi: 10.11896/jsjkx.200900107

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于最优间隔的AdaBoostv算法的非平衡数据分类

鲁淑霞1,2, 张振莲1   

  1. 1 河北大学数学与信息科学学院 河北 保定071002
    2 河北省机器学习与计算智能重点实验室 河北 保定071002
  • 收稿日期:2020-09-11 修回日期:2021-02-06 出版日期:2021-11-15 发布日期:2021-11-10
  • 通讯作者: 鲁淑霞(cmclusx@126.com)
  • 基金资助:
    国家自然科学基金项目(61672205);河北省科技计划重点研发项目(19210310D)

Imbalanced Data Classification of AdaBoostv Algorithm Based on Optimum Margin

LU Shu-xia1,2, ZHANG Zhen-lian1   

  1. 1 College of Mathematics and Information Science,Hebei University,Baoding,Hebei 071002,China
    2 Hebei Province Key Laboratory of Machine Learning and Computational Intelligence,Baoding,Hebei 071002,China
  • Received:2020-09-11 Revised:2021-02-06 Online:2021-11-15 Published:2021-11-10
  • About author:LU Shu-xia,born in 1966,Ph.D,professor,postgraduate supervisor,is a member of China Computer Federation.Her main research interests include machine learning and so on.
  • Supported by:
    National Natural Science Foundation of China(61672205) and Key R & D Program of Science and Technology Foundation of Hebei Province(19210310D).

摘要: 为了解决非平衡数据分类问题,提出了一种基于最优间隔的AdaBoostv算法。该算法采用改进的SVM作为基分类器,在SVM的优化模型中引入间隔均值项,并根据数据非平衡比对间隔均值项和损失函数项进行加权;采用带有方差减小的随机梯度方法(Stochastic Variance Reduced Gradient,SVRG) 对优化模型进行求解,以加快收敛速度。所提基于最优间隔的AdaBoostv算法在样本权重更新公式中引入了一种新的自适应代价敏感函数,赋予少数类样本、误分类的少数类样本以及靠近决策边界的少数类样本更高的代价值;另外,通过结合新的权重公式以及引入给定精度参数v下的最优间隔的估计值,推导出新的基分类器权重策略,进一步提高了算法的分类精度。对比实验表明,在线性和非线性情况下,所提基于最优间隔的AdaBoostv算法在非平衡数据集上的分类精度优于其他算法,且能获得更大的最小间隔。

关键词: AdaBoostv, SVRG, 非平衡数据, 自适应代价敏感函数, 最优间隔

Abstract: In order to solve the problem of imbalanced data classification,this paper proposes an AdaBoostv algorithm based on optimal margin.In this algorithm,the improved SVM is used as the base classifier,the margin mean term is introduced into the optimization model of SVM,and the margin mean term and loss function term are weighted by data imbalance ratio.The stochastic variance reduced gradient (SVRG) is used to solve the optimization model to improve the convergence rate.In the optimal margin AdaBoostv algorithm,a new adaptive cost sensitive function is introduced into the instance weight update formula,the minority instances,the misclassified instances and the borderline minority instances are assigned higher cost values.In addition,a new weight strategy of the base classifier is derived by combining the new weight formula and introducing the estimated value of the optimal margin under the given precision parameter v,so as to further improve the classification accuracy of the algorithm.The experimental results show that the classification accuracy of the AdaBoostv algorithm with optimal margin is better than other algorithms on imbalanced datasets in the case of linear and nonlinear,and it can obtain a larger minimum margin.

Key words: AdaBoostv, Adaptive cost sensitive function, Imbalanced data, Optimum margin, SVRG

中图分类号: 

  • TP181
[1]BACH M,WERNER A,YWIEC J,et al.The study of under and over-sampling methods utility in analysis of highly imbalanced data on osteoporosis[J].Information Sciences,2017,384(1):174-190.
[2]AMRINE D E,MCLELLAN J G,WHITE B J,et al.Evaluation of three classification models to predict risk class of cattle cohorts developing bovine respiratory disease within the first 14days on feed using on-arrival and/or pre-arrival information[J].Computers & Electronics in Agriculture,2019,156:439-446.
[3]VO D M,LEE S W.Robust face recognition via hierarchical collaborative representation[J].Information Sciences,2018,432:332-346.
[4]WANG W,LIU J,PITSILIS G,et al.Abstracting massive data for lightweight intrusion detection in computer networks[J].Information Sciences,2018,433:417-430.
[5]HAN X,CUI R B,LAN Y F,et al.A Gaussian mixture model based combined resampling algorithm for classification of imba-lanced credit data sets[J].International Journal of Machine Learning and Cybernetics,2019,10:3687-3699.
[6]SHAHEE S A,ANANTHAKUMAR U.An adaptive oversampling technique for imbalanced datasets[J].Computer and Information Engineering,2018,12:1-16.
[7]NIU Z,LI F L,ZHANG X Y,et al,et al.Improved under-sampling method and its application in the classification of imba-lanced data sets[J].Computer Engineering,2019,45(6):218-224.
[8]YANG H,CHEN H M.Mixed-sampling Method for Imbalanced Data Based on Quantum Evolutionary Algorithm[J].Computer Science,2020,47(11):88-94.
[9]VEROPOULOS K,CAMPBELL C,CRISTIANINI N,et al.Controlling the sensitivity of support vector machines[C]//Proceedings of the International Joint Conference Artificial Intelligence.1999:55-60.
[10]SUN Y,KAMELl M S,WONG A K C,et al.Cost-sensitiveboosting for classification of imbalanced data[J].Pattern Re-cognition,2007,40(12):3358-3378.
[11]TAO X,LI Q,GUO W,et al.Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification[J].Information Sciences,2019,52(4):132-140.
[12]SCHAPIRE R E,FREUND Y,BARTLETT P,et al.Boosting the margin:a new explanation for the effectiveness of voting methods[J].The Annals of Stats,1998,26(5):1651-1686.
[13]RUDIN C,SCHAPIRE R E.On the dynamics of boosting[C]//Advances in Neural Information Processing Systems.2004:32-40.
[14]GRONLUND A,LARSEN K G,MATHIASEN A.OptimalMinimal Margin Maximization with Boosting[C]//Proceedings of the 36th International Conference on Machine Learning(PMLR 97).2019:24-28.
[15]RATSCH G.Soft margins for AdaBoost[J].Machine Learning,2001,42(3):287-320.
[16]RATSCH G,WARMUTH M K.Maximizing the margin with boosting[C]//Proceedings of the Annual Conference on Computational Learning Theory(COLT 2002).2002:319-333.
[17]BREIMAN L.Predictiongames and arcing algorithms[J].Neural Computation,1999,11(7):1493-1518.
[18]RATSCH G,WARMUTH M K.Efficient Margin Maximizingwith Boosting[J].Journal of Machine Learning Research,2005,6:2131-2152.
[19]CHENG F,ZHANG J,WEN C,et al.Large Cost-Sensitive Margin Distribution Machine for Imbalanced Data Classification[J].Neurocomputing,2016,24(8):45-57.
[20]ZHANG P Z,ZHANG H Y.A Review of Features and Labels Dimensionality Reduction Methods of Multi Label Data[J].Journal of Chongqing Technology and Business University(Na-tural Science Edition),2020,37(5):23-29.
[21]JOHNSON R,ZHANG T.Accelerating stochastic gradient descent using predictive variance reduction[C]//Advanced in Neural Information Systems.2013:315-323.
[22]NEUMANN J V.Zur Theorie der Gesellschaftsspiele[J].Ma-thematische Annalen,1928,100(1):295-320.
[23]STEFANO C D,MANIACI M,FONTANELLA F,et al.Reliable writer identification in medieval manuscripts through page layout features:The “Avila” Bible case[J].Engineering Applications of Artificial Intelligence,2018,72(1):99-110.
[24]KEEL:A software tool to assess evolutionary algorithms forData Mining problems [EB/OL].(2005-11-05)[2019-05-30].http://www.keel.es/.
[25]SHEN C,LI H.Boosting Through Optimization of Margin Distributions[J].IEEE Transactions on Neural Networks,2010,21(4):659-666.
[1] 杨浩, 陈红梅.
基于量子进化算法的非平衡数据混合采样算法
Mixed-sampling Method for Imbalanced Data Based on Quantum Evolutionary Algorithm
计算机科学, 2020, 47(11): 88-94. https://doi.org/10.11896/jsjkx.191000102
[2] 周晓敏, 曹付元, 余丽琴.
一种基于样本分层的双向过采样方法
Bi-directional Oversampling Method Based on Sample Stratification
计算机科学, 2019, 46(12): 83-88. https://doi.org/10.11896/jsjkx.190400053
[3] 江鹏,叶阳东,娄铮铮.
一种面向非平衡数据的多簇IB算法
Multi-clusters IB Algorithm for Imbalanced Data Set
计算机科学, 2016, 43(7): 245-250. https://doi.org/10.11896/j.issn.1002-137X.2016.07.044
[4] 职为梅,郭华平,范明.
抽样技术和CBES分类非平衡数据集
Sampling Techniques with CBES for Imbalanced Learning
计算机科学, 2013, 40(12): 70-74.
[5] 职为梅,郭华平,范明,叶阳东.
非平衡数据集分类方法探讨
Discussion of Classification for Imbalanced Data Sets
计算机科学, 2012, 39(Z6): 304-308.
[6] .
非平衡数据集分类问题研究进展

计算机科学, 2008, 35(4): 10-13.
[7] .
非平衡数据训练方法概述

计算机科学, 2005, 32(10): 181-186.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!