计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 8-12.doi: 10.11896/jsjkx.180901813

• 大数据与数据科学 • 上一篇    下一篇

AdaBoostRS:高维不平衡数据学习的集成整合

杨平安, 林亚平, 祝团飞   

  1. (湖南大学信息科学与工程学院 长沙410000)
  • 收稿日期:2018-09-27 出版日期:2019-12-15 发布日期:2019-12-17
  • 通讯作者: 林亚平(1956-),男,博士,教授,博士生导师,主要研究方向为计算机网络、云安全和机器学习等,E-mail:yplin@hun.edu.cn。
  • 作者简介:杨平安(1995-),女,硕士生,主要研究方向为机器学习、数据挖掘等,E-mail:ypingan@hnu.edu.cn;祝团飞(1987-),男,博士,CCF会员,主要研究方向为云安全和机器学习等。

AdaBoostRS:Integration of High-dimensional Unbalanced Data Learning

YANG Ping-an, LIN Ya-ping, ZHU Tuan-fei   

  1. (College of Information Science and Engineering,Hunan University,Changsha 410000,China)
  • Received:2018-09-27 Online:2019-12-15 Published:2019-12-17

摘要: 机器学习中类不平衡分布问题包含了不同类之间数据样本的偏差分布,导致学习过程更偏向于多数类。而高维数据的稀疏性使得分类的偏差更加明显,因此对于高维不平衡数据,维度灾难与类不平衡分布这两个挑战性问题相互叠加在一起,使得解决高维不平衡问题变得更为困难。针对这一问题,文中提出结合随机子空间和SMOTE过采样技术的AdaBoost集成方法(AdaBoost ensemble of Random subspace and SMOTE,AdaBoostRS)来处理高维不平衡数据的分类。具体地,AdaBoostRS通过随机子空间选取部分特征来训练每个分类器,以增加分类样本的多样性和降低高维数据的维度,然后通过SMOTE方法对降维数据的少数类进行线性插值,以解决类不平衡问题。基于8个高维不平衡的标准时间序列数据集进行实验,结果表明,以F-measure、G-mean与AUC 3个性能指标来进行评判,AdaBoostRS优于传统的集成学习方法。

关键词: 高维不平衡, 随机子空间, SMOTE, AdaBoost

Abstract: The class imbalance problem in machine learning contains a skewed distribution of data samples among different classes,resulting in a learning bias toward the majority class.In high-dimensional data,the sparseness of the data makes the classification bias more obvious.For high-dimensional unbalanced data,the two challenging problems of dimensional disaster and class imbalance distribution are superimposed,making it more difficult to solve high-dimensional imbalance problems.This paper proposed an AdaBoost integration method combining random subspace and SMOTE oversampling technology,named AdaBoostRS (AdaBoost ensemble of Random subspace and SMOTE),to deal with the classification of high-dimensional unbalanced data.AdaBoostRS trains each classifier by selecting partial features in a random subspace to increase the diversity of the classification samples and reduce the dimensions of the high-dimensional data.Thena few classes of dimensionality reduction data are linearly interpolated through the SMOTE method to solve the class imbalance problem.The experiment is based on 8 high-dimensional unbalanced standard time series dataset.The results show that AdaBoostRS is superior to the traditional integrated learning method in terms of three performance indicators of F-measure,G-mean and AUC.

Key words: High-dimensional imbalance, Random subspace, SMOTE, AdaBoost

中图分类号: 

  • TP301.6
[1] PARVIN H,BEHROUZ M B,HOSEIN A.Detection of cancer patients using an innovative method for learning at imbalanced datasets[C]//International Conference on Rough Sets and Knowledge Technology.Springer,Berlin,Heidelberg,2011.
[2] CIESLAK D A,CHAWLA N V,STRIEGEL A.Combating im- balance in network intrusion datasets [C]//GrC.2006:732-737.
[3] JING X Y,WU F,DONG X W,et al.An improved SDA based defect prediction framework for both within-project and cross-project class-imbalance problems[J].IEEE Transactions on Software Engineering,2017,43(4):321-339.
[4] ZHANG Y,ZHOU Z H.Cost-sensitive face recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2010,32(10):1758-1769.
[5] LIU C L,HSAIO W H,LEE C H,et al.Semi-supervised text classification with universumlearning[J].IEEE Transactions on Cybernetics,2016,46(2):462-473.
[6] LIU X Y,WU J X,ZHOU Z H.Exploratory undersampling for class-imbalance learning[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B (Cybernetics),2009,39(2):539-550.
[7] SÁEZ J A S,LUENGO J,STEFANOWSKI J,et al.SMOTE-IPF:Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering[J].Information Sciences,2009,21(9):184-203.
[8] HE H B,GARCIA E A.Learning from imbalanced data[J].IEEE Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284.
[9] ALBERTO C,ZAFRA A,VENTURA S.Weighted data gravitation classification for standard and imbalanced data[J].IEEE Transactions on Cybernetics,2013,43(6):1672-1687.
[10] DANIELE C R,PORTINALE L.Dynamic Bayesian networks for fault detection,identification,and recovery in autonomous spacecraft[J].IEEE Transactions on Systems,Man,and Cybernetics:Systems,2015,45(1):13-24.
[11] TANG Y,ZHANG Y Q,CHAWLA N V,et al.SVMs modeling for highly imbalanced classification[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B (Cybernetics),2009,39(1):281-288.
[12] KANG Q,HUANG B Y,ZHOU M C.Dynamic behavior of artificial Hodgkin-Huxley neuron model subject to additive noise[J].IEEE Transactions on Cybernetics,2016,46(9):2083-2093.
[13] ZHANG X W,HU B G.A new strategy of cost-free learning in the class imbalance problem[J].IEEE Transactions on Know-ledge & Data Engineering,2014,26(12):2872-2885.
[14] LIU X Y,ZHOU Z H.The influence of class imbalance on cost-sensitive learning[C]//Sixth International Conference on Data Mining (ICDM’06).IEEE,2006:970-974.
[15] WEISS,GARY M.Mining with rarity:a unifying framework [J].ACM Sigkdd Explorations Newsletter,2004,6(1):7-19.
[16] PRATI,RONALDO C,BATISTA G E,et al.Class imbalances versus class overlapping:an analysis of a learning system beha-vior[C]//Mexican International Conference on Artificial Intelligence.Springer,Berlin,Heidelberg,2004.
[17] RAO,BHARAT R,KRISHNAN S,et al.Data mining for improved cardiac care[J].ACM SIGKDD Explorations Newsletter 2006,8(1):3-10.
[18] JAPKOWICZ,NATHALIE,MYERS C,et al.A novelty detection approach to classification[M].Morgan Kaufmann Publi-shers Inc,1995.
[19] DI MARTINO M,DECIA F,MOLINELLI J,et al.Improving Electric Fraud Detection using Class Imbalance Strategies [C]//ICPRAM.2012:135-141.
[20] VICTORIA L,SARA D R,MANUEL B J,et al.Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data [J].Fuzzy Sets and Systems,2015(258):5-38.
[21] BARTOSZ K,WOC'NIAK M,SCHAEFER G.Cost-sensitive decision tree ensembles for effective imbalanced classification[J].Applied Soft Computing,2014(14):554-562.
[22] MACIEJ Z,TOMCZAK J M.Boosted SVM with active learning strategy for imbalanced data[J].Soft Computing,2015,19(12):3357-3368.
[23] CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[24] HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing.Springer,Berlin,Heidelberg,2005.
[25] YOUNGW A,NYKL S L,WECKMAN G R,et al.Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets[J].Neural Computing and Applications,2015,26(5):1041-1054.
[26] LIU X Y,WU J,ZHOU Z H.Exploratory Under-sampling for class-imbalance learning,bioinformatics[J].Proceedings of the IEEE Transactions on Systems,Man,and Cybernetics,Part B:Cybernetics,2009,39(2):539-550.
[27] VORRABOOT P,RASMEQUAN S,CHINNASARN K.Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms[J].Neurocomputing,2015(152):429-443.
[28] YU H L,NI J,ZHAO J.ACOSampling:an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data[J].Neurocomputing,2013(101):309-318.
[29] YIN Q Y,ZHANG J S,ZHANG C X,et al.A novel selective en- semble algorithm for imbalanced data classification based on exploratory undersampling[J].Mathematical Problems in Engineering,2014,71(3):741-764.
[30] YOAV F.Boosting a weak learning algorithm by majority[J].Information and Computation,1995,121(2):256-285.
[31] CHAWLA N V,LAZAREVIC A,HALL L O,et al.SMOTEBoost:Improving Prediction of the Minority Class in Boosting.[J].Lecture Notes in Computer Science,2003,2838:107-119.
[32] SEIFFERT C,KHOSHGOFTAAR T M,VAN HULSE J,et al.RUSBoost:a hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems,Man,and Cybernetics-Part A:Systems and Humans,2010,40(1):185-197.
[33] LIU X Y,WU J,ZHOU Z H.Exploratory Under-sampling for class-imbalance learning,bioinformatics [J].Proceedings of the IEEE Transactions on Systems,Man,and Cybernetics,Part B:Cybernetics,2009,39(2):539-550.
[34] NANNI L,FANTOZZI C,LAZZARINI N.Coupling different methods for overcoming the class imbalance problem[J].Neurocomputing,2015,158:48-61.
[35] SUN Z,SONG Q,ZHU X.A novel ensemble method forclassi- fying imbalanced data[J].Pattern Recognition,2015,48:1623-1637.
[36] DÍEZ-PASTOR J F,RODRÍGUEZ J J,GARCÍA-OSORIO C, et al.Random balance:ensembles of variable prors classifiers for imbalanced data[J].Knowledge-Based Systems,2015,85:96-111.
[37] KRAWCZYK B,SCHAEFER G.An improved ensemble ap- proach for imbalanced classification problems[C]//IEEE,International Symposium on Applied Computational Intelligence and Informatics.IEEE,2013:423-426.
[38] ZIEBA M,TOMCZAK J M.Boosted SVM with active learning strategy for imbalanced data[J].Soft Computing,2015,19(12):3357-3368.
[39] BELLINGER C,JAPKOWICZ N,DRUMMOND C.Christopher Drummond.Synthetic Oversampling for Advanced Radioactive Threat Detection[C]//2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).IEEE,2015:948-953.
[40] MATHIEU B,SEKI K,UEHARA K.Tackling class imbalance and data scarcity in literature-based gene function annotation[C]//Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2011.
[41] NGUWI Y Y,CHO S Y.Support vector self-organizing learning for imbalanced medical data[C]//International Joint Conference on Neural Networks(IJCNN 2009).IEEE,2009:2250-2255.
[42] NASRABADI,NASSER M.Pattern recognition and machine learning[J].Journal of electronic imaging,2007,16(4):049901.
[43] YANG Q,WU X D.10 challenging problems in data mining research.International[J].Journal of Information Technology & Decision Making,2006,5(4):597-604.
[44] BELLINGER C,DRUMMOND C,JAPKOWICZ N.Manifold- based synthetic oversampling with manifold conformance estimation[J].Machine Learning,2018,107(3):605-637.
[45] CUI Y,MA H,SAHA T.Improvement of power transformer insulation diagnosis using oil characteristics data preprocessed by SMOTEBoosttechnique[J].IEEE Transactions on Dielectrics and Electrical Insulation,2014,21(5):2363-2373.
[46] GU J,JIAO L,LIU F,et al.Random subspace based ensemble sparse representation[J].Pattern Recognition,2018(74):544-555.
[47] KEOGH E,XI X,WEI L C A.Ratanamahatana.UCRTime Series Classification/ClusteringPage[OL].http://www.cs.ucr.edu/~eamonn/time_series_data.
[48] WEI L,KEOGH E J.Semi-Supervised Time Series Classification[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2006:748-753.
[49] GAO J W,LIANG J Y.Research and advancement of classification method of imbalanced data sets[J].Computer Sciense,2008,35:10-13.
[50] LI K W,YANG L,LIU W Y,et al.Unbalanced Data Classification Method Based on RSBoost Algorithm[J].Computer Scien-ce,2015,42(9):249-252.
[51] CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[1] 董明刚,姜振龙,敬超. 基于海林格距离和SMOTE的多类不平衡学习算法[J]. 计算机科学, 2020, 47(1): 102-109.
[2] 韩慧,王黎明,柴玉梅,刘箴. 基于强化表征学习深度森林的文本情感分类[J]. 计算机科学, 2019, 46(7): 172-179.
[3] 金旭, 王磊, 孙国梓, 李华康. 一种基于质心空间的不均衡数据欠采样方法[J]. 计算机科学, 2019, 46(2): 50-55.
[4] 王莉, 陈红梅. 基于NKSMOTE算法的非平衡数据集分类方法[J]. 计算机科学, 2018, 45(9): 260-265.
[5] 陈圣灵,沈思淇,李东升. 基于样本权重更新的不平衡数据集成学习方法[J]. 计算机科学, 2018, 45(7): 31-37.
[6] 李珊,饶文碧. 基于视频的矿井中人体运动区域检测[J]. 计算机科学, 2018, 45(4): 291-295.
[7] 熊婧,高岩,王雅瑜. 基于Adaboost算法的软件缺陷预测模型[J]. 计算机科学, 2016, 43(7): 186-190.
[8] 皮嘉立,巫正中,陈卓. 基于Adaboost-CSHG的特定类目标跟踪识别[J]. 计算机科学, 2016, 43(4): 318-321.
[9] 宋相法,曹志伟,郑逢斌,焦李成. 基于随机子空间核极端学习机集成的高光谱遥感图像分类[J]. 计算机科学, 2016, 43(3): 301-304.
[10] 霍芋霖,符意德. 基于Zynq的人脸检测设计[J]. 计算机科学, 2016, 43(10): 322-325.
[11] 张朝晖,刘永霞,雷 倩. 基于SC-AdaBoost的图像目标检测[J]. 计算机科学, 2015, 42(7): 309-313.
[12] 田红梅,彭 博,李天瑞,谢宗霞. 基于监督学习的日冕暗化检测与提取算法[J]. 计算机科学, 2015, 42(5): 47-50.
[13] 朱二喜,徐敏. 一种新型智能交通违章信息采集器的研究与设计[J]. 计算机科学, 2014, 41(Z11): 478-481.
[14] 黄秀清,黄巍,高强,陆云,陈传波. 基于嘴部状态分类的内唇开度估计算法[J]. 计算机科学, 2014, 41(5): 296-298.
[15] 谭爱平,陈浩,吴伯桥. 基于SVM的网络入侵检测集成学习算法[J]. 计算机科学, 2014, 41(2): 197-200.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[8] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[9] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .
[10] 杨羽琦,章国安,金喜龙. 车载自组织网络中基于车辆密度的双簇头路由协议[J]. 计算机科学, 2018, 45(4): 126 -130 .