计算机科学 ›› 2021, Vol. 48 ›› Issue (11): 184-191.doi: 10.11896/jsjkx.200900107
鲁淑霞1,2, 张振莲1
LU Shu-xia1,2, ZHANG Zhen-lian1
摘要: 为了解决非平衡数据分类问题,提出了一种基于最优间隔的AdaBoostv算法。该算法采用改进的SVM作为基分类器,在SVM的优化模型中引入间隔均值项,并根据数据非平衡比对间隔均值项和损失函数项进行加权;采用带有方差减小的随机梯度方法(Stochastic Variance Reduced Gradient,SVRG) 对优化模型进行求解,以加快收敛速度。所提基于最优间隔的AdaBoostv算法在样本权重更新公式中引入了一种新的自适应代价敏感函数,赋予少数类样本、误分类的少数类样本以及靠近决策边界的少数类样本更高的代价值;另外,通过结合新的权重公式以及引入给定精度参数v下的最优间隔的估计值,推导出新的基分类器权重策略,进一步提高了算法的分类精度。对比实验表明,在线性和非线性情况下,所提基于最优间隔的AdaBoostv算法在非平衡数据集上的分类精度优于其他算法,且能获得更大的最小间隔。
中图分类号:
[1]BACH M,WERNER A,YWIEC J,et al.The study of under and over-sampling methods utility in analysis of highly imbalanced data on osteoporosis[J].Information Sciences,2017,384(1):174-190. [2]AMRINE D E,MCLELLAN J G,WHITE B J,et al.Evaluation of three classification models to predict risk class of cattle cohorts developing bovine respiratory disease within the first 14days on feed using on-arrival and/or pre-arrival information[J].Computers & Electronics in Agriculture,2019,156:439-446. [3]VO D M,LEE S W.Robust face recognition via hierarchical collaborative representation[J].Information Sciences,2018,432:332-346. [4]WANG W,LIU J,PITSILIS G,et al.Abstracting massive data for lightweight intrusion detection in computer networks[J].Information Sciences,2018,433:417-430. [5]HAN X,CUI R B,LAN Y F,et al.A Gaussian mixture model based combined resampling algorithm for classification of imba-lanced credit data sets[J].International Journal of Machine Learning and Cybernetics,2019,10:3687-3699. [6]SHAHEE S A,ANANTHAKUMAR U.An adaptive oversampling technique for imbalanced datasets[J].Computer and Information Engineering,2018,12:1-16. [7]NIU Z,LI F L,ZHANG X Y,et al,et al.Improved under-sampling method and its application in the classification of imba-lanced data sets[J].Computer Engineering,2019,45(6):218-224. [8]YANG H,CHEN H M.Mixed-sampling Method for Imbalanced Data Based on Quantum Evolutionary Algorithm[J].Computer Science,2020,47(11):88-94. [9]VEROPOULOS K,CAMPBELL C,CRISTIANINI N,et al.Controlling the sensitivity of support vector machines[C]//Proceedings of the International Joint Conference Artificial Intelligence.1999:55-60. [10]SUN Y,KAMELl M S,WONG A K C,et al.Cost-sensitiveboosting for classification of imbalanced data[J].Pattern Re-cognition,2007,40(12):3358-3378. [11]TAO X,LI Q,GUO W,et al.Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification[J].Information Sciences,2019,52(4):132-140. [12]SCHAPIRE R E,FREUND Y,BARTLETT P,et al.Boosting the margin:a new explanation for the effectiveness of voting methods[J].The Annals of Stats,1998,26(5):1651-1686. [13]RUDIN C,SCHAPIRE R E.On the dynamics of boosting[C]//Advances in Neural Information Processing Systems.2004:32-40. [14]GRONLUND A,LARSEN K G,MATHIASEN A.OptimalMinimal Margin Maximization with Boosting[C]//Proceedings of the 36th International Conference on Machine Learning(PMLR 97).2019:24-28. [15]RATSCH G.Soft margins for AdaBoost[J].Machine Learning,2001,42(3):287-320. [16]RATSCH G,WARMUTH M K.Maximizing the margin with boosting[C]//Proceedings of the Annual Conference on Computational Learning Theory(COLT 2002).2002:319-333. [17]BREIMAN L.Predictiongames and arcing algorithms[J].Neural Computation,1999,11(7):1493-1518. [18]RATSCH G,WARMUTH M K.Efficient Margin Maximizingwith Boosting[J].Journal of Machine Learning Research,2005,6:2131-2152. [19]CHENG F,ZHANG J,WEN C,et al.Large Cost-Sensitive Margin Distribution Machine for Imbalanced Data Classification[J].Neurocomputing,2016,24(8):45-57. [20]ZHANG P Z,ZHANG H Y.A Review of Features and Labels Dimensionality Reduction Methods of Multi Label Data[J].Journal of Chongqing Technology and Business University(Na-tural Science Edition),2020,37(5):23-29. [21]JOHNSON R,ZHANG T.Accelerating stochastic gradient descent using predictive variance reduction[C]//Advanced in Neural Information Systems.2013:315-323. [22]NEUMANN J V.Zur Theorie der Gesellschaftsspiele[J].Ma-thematische Annalen,1928,100(1):295-320. [23]STEFANO C D,MANIACI M,FONTANELLA F,et al.Reliable writer identification in medieval manuscripts through page layout features:The “Avila” Bible case[J].Engineering Applications of Artificial Intelligence,2018,72(1):99-110. [24]KEEL:A software tool to assess evolutionary algorithms forData Mining problems [EB/OL].(2005-11-05)[2019-05-30].http://www.keel.es/. [25]SHEN C,LI H.Boosting Through Optimization of Margin Distributions[J].IEEE Transactions on Neural Networks,2010,21(4):659-666. |
[1] | 杨浩, 陈红梅. 基于量子进化算法的非平衡数据混合采样算法 Mixed-sampling Method for Imbalanced Data Based on Quantum Evolutionary Algorithm 计算机科学, 2020, 47(11): 88-94. https://doi.org/10.11896/jsjkx.191000102 |
[2] | 周晓敏, 曹付元, 余丽琴. 一种基于样本分层的双向过采样方法 Bi-directional Oversampling Method Based on Sample Stratification 计算机科学, 2019, 46(12): 83-88. https://doi.org/10.11896/jsjkx.190400053 |
[3] | 江鹏,叶阳东,娄铮铮. 一种面向非平衡数据的多簇IB算法 Multi-clusters IB Algorithm for Imbalanced Data Set 计算机科学, 2016, 43(7): 245-250. https://doi.org/10.11896/j.issn.1002-137X.2016.07.044 |
[4] | 职为梅,郭华平,范明. 抽样技术和CBES分类非平衡数据集 Sampling Techniques with CBES for Imbalanced Learning 计算机科学, 2013, 40(12): 70-74. |
[5] | 职为梅,郭华平,范明,叶阳东. 非平衡数据集分类方法探讨 Discussion of Classification for Imbalanced Data Sets 计算机科学, 2012, 39(Z6): 304-308. |
[6] | . 非平衡数据集分类问题研究进展 计算机科学, 2008, 35(4): 10-13. |
[7] | . 非平衡数据训练方法概述 计算机科学, 2005, 32(10): 181-186. |
|