计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 8-12.doi: 10.11896/jsjkx.180901813
杨平安, 林亚平, 祝团飞
YANG Ping-an, LIN Ya-ping, ZHU Tuan-fei
摘要: 机器学习中类不平衡分布问题包含了不同类之间数据样本的偏差分布,导致学习过程更偏向于多数类。而高维数据的稀疏性使得分类的偏差更加明显,因此对于高维不平衡数据,维度灾难与类不平衡分布这两个挑战性问题相互叠加在一起,使得解决高维不平衡问题变得更为困难。针对这一问题,文中提出结合随机子空间和SMOTE过采样技术的AdaBoost集成方法(AdaBoost ensemble of Random subspace and SMOTE,AdaBoostRS)来处理高维不平衡数据的分类。具体地,AdaBoostRS通过随机子空间选取部分特征来训练每个分类器,以增加分类样本的多样性和降低高维数据的维度,然后通过SMOTE方法对降维数据的少数类进行线性插值,以解决类不平衡问题。基于8个高维不平衡的标准时间序列数据集进行实验,结果表明,以F-measure、G-mean与AUC 3个性能指标来进行评判,AdaBoostRS优于传统的集成学习方法。
中图分类号:
[1] | PARVIN H,BEHROUZ M B,HOSEIN A.Detection of cancer patients using an innovative method for learning at imbalanced datasets[C]//International Conference on Rough Sets and Knowledge Technology.Springer,Berlin,Heidelberg,2011. |
[2] | CIESLAK D A,CHAWLA N V,STRIEGEL A.Combating im- balance in network intrusion datasets [C]//GrC.2006:732-737. |
[3] | JING X Y,WU F,DONG X W,et al.An improved SDA based defect prediction framework for both within-project and cross-project class-imbalance problems[J].IEEE Transactions on Software Engineering,2017,43(4):321-339. |
[4] | ZHANG Y,ZHOU Z H.Cost-sensitive face recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2010,32(10):1758-1769. |
[5] | LIU C L,HSAIO W H,LEE C H,et al.Semi-supervised text classification with universumlearning[J].IEEE Transactions on Cybernetics,2016,46(2):462-473. |
[6] | LIU X Y,WU J X,ZHOU Z H.Exploratory undersampling for class-imbalance learning[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B (Cybernetics),2009,39(2):539-550. |
[7] | SÁEZ J A S,LUENGO J,STEFANOWSKI J,et al.SMOTE-IPF:Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering[J].Information Sciences,2009,21(9):184-203. |
[8] | HE H B,GARCIA E A.Learning from imbalanced data[J].IEEE Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284. |
[9] | ALBERTO C,ZAFRA A,VENTURA S.Weighted data gravitation classification for standard and imbalanced data[J].IEEE Transactions on Cybernetics,2013,43(6):1672-1687. |
[10] | DANIELE C R,PORTINALE L.Dynamic Bayesian networks for fault detection,identification,and recovery in autonomous spacecraft[J].IEEE Transactions on Systems,Man,and Cybernetics:Systems,2015,45(1):13-24. |
[11] | TANG Y,ZHANG Y Q,CHAWLA N V,et al.SVMs modeling for highly imbalanced classification[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B (Cybernetics),2009,39(1):281-288. |
[12] | KANG Q,HUANG B Y,ZHOU M C.Dynamic behavior of artificial Hodgkin-Huxley neuron model subject to additive noise[J].IEEE Transactions on Cybernetics,2016,46(9):2083-2093. |
[13] | ZHANG X W,HU B G.A new strategy of cost-free learning in the class imbalance problem[J].IEEE Transactions on Know-ledge & Data Engineering,2014,26(12):2872-2885. |
[14] | LIU X Y,ZHOU Z H.The influence of class imbalance on cost-sensitive learning[C]//Sixth International Conference on Data Mining (ICDM’06).IEEE,2006:970-974. |
[15] | WEISS,GARY M.Mining with rarity:a unifying framework [J].ACM Sigkdd Explorations Newsletter,2004,6(1):7-19. |
[16] | PRATI,RONALDO C,BATISTA G E,et al.Class imbalances versus class overlapping:an analysis of a learning system beha-vior[C]//Mexican International Conference on Artificial Intelligence.Springer,Berlin,Heidelberg,2004. |
[17] | RAO,BHARAT R,KRISHNAN S,et al.Data mining for improved cardiac care[J].ACM SIGKDD Explorations Newsletter 2006,8(1):3-10. |
[18] | JAPKOWICZ,NATHALIE,MYERS C,et al.A novelty detection approach to classification[M].Morgan Kaufmann Publi-shers Inc,1995. |
[19] | DI MARTINO M,DECIA F,MOLINELLI J,et al.Improving Electric Fraud Detection using Class Imbalance Strategies [C]//ICPRAM.2012:135-141. |
[20] | VICTORIA L,SARA D R,MANUEL B J,et al.Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data [J].Fuzzy Sets and Systems,2015(258):5-38. |
[21] | BARTOSZ K,WOC'NIAK M,SCHAEFER G.Cost-sensitive decision tree ensembles for effective imbalanced classification[J].Applied Soft Computing,2014(14):554-562. |
[22] | MACIEJ Z,TOMCZAK J M.Boosted SVM with active learning strategy for imbalanced data[J].Soft Computing,2015,19(12):3357-3368. |
[23] | CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357. |
[24] | HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing.Springer,Berlin,Heidelberg,2005. |
[25] | YOUNGW A,NYKL S L,WECKMAN G R,et al.Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets[J].Neural Computing and Applications,2015,26(5):1041-1054. |
[26] | LIU X Y,WU J,ZHOU Z H.Exploratory Under-sampling for class-imbalance learning,bioinformatics[J].Proceedings of the IEEE Transactions on Systems,Man,and Cybernetics,Part B:Cybernetics,2009,39(2):539-550. |
[27] | VORRABOOT P,RASMEQUAN S,CHINNASARN K.Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms[J].Neurocomputing,2015(152):429-443. |
[28] | YU H L,NI J,ZHAO J.ACOSampling:an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data[J].Neurocomputing,2013(101):309-318. |
[29] | YIN Q Y,ZHANG J S,ZHANG C X,et al.A novel selective en- semble algorithm for imbalanced data classification based on exploratory undersampling[J].Mathematical Problems in Engineering,2014,71(3):741-764. |
[30] | YOAV F.Boosting a weak learning algorithm by majority[J].Information and Computation,1995,121(2):256-285. |
[31] | CHAWLA N V,LAZAREVIC A,HALL L O,et al.SMOTEBoost:Improving Prediction of the Minority Class in Boosting.[J].Lecture Notes in Computer Science,2003,2838:107-119. |
[32] | SEIFFERT C,KHOSHGOFTAAR T M,VAN HULSE J,et al.RUSBoost:a hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems,Man,and Cybernetics-Part A:Systems and Humans,2010,40(1):185-197. |
[33] | LIU X Y,WU J,ZHOU Z H.Exploratory Under-sampling for class-imbalance learning,bioinformatics [J].Proceedings of the IEEE Transactions on Systems,Man,and Cybernetics,Part B:Cybernetics,2009,39(2):539-550. |
[34] | NANNI L,FANTOZZI C,LAZZARINI N.Coupling different methods for overcoming the class imbalance problem[J].Neurocomputing,2015,158:48-61. |
[35] | SUN Z,SONG Q,ZHU X.A novel ensemble method forclassi- fying imbalanced data[J].Pattern Recognition,2015,48:1623-1637. |
[36] | DÍEZ-PASTOR J F,RODRÍGUEZ J J,GARCÍA-OSORIO C, et al.Random balance:ensembles of variable prors classifiers for imbalanced data[J].Knowledge-Based Systems,2015,85:96-111. |
[37] | KRAWCZYK B,SCHAEFER G.An improved ensemble ap- proach for imbalanced classification problems[C]//IEEE,International Symposium on Applied Computational Intelligence and Informatics.IEEE,2013:423-426. |
[38] | ZIEBA M,TOMCZAK J M.Boosted SVM with active learning strategy for imbalanced data[J].Soft Computing,2015,19(12):3357-3368. |
[39] | BELLINGER C,JAPKOWICZ N,DRUMMOND C.Christopher Drummond.Synthetic Oversampling for Advanced Radioactive Threat Detection[C]//2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).IEEE,2015:948-953. |
[40] | MATHIEU B,SEKI K,UEHARA K.Tackling class imbalance and data scarcity in literature-based gene function annotation[C]//Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2011. |
[41] | NGUWI Y Y,CHO S Y.Support vector self-organizing learning for imbalanced medical data[C]//International Joint Conference on Neural Networks(IJCNN 2009).IEEE,2009:2250-2255. |
[42] | NASRABADI,NASSER M.Pattern recognition and machine learning[J].Journal of electronic imaging,2007,16(4):049901. |
[43] | YANG Q,WU X D.10 challenging problems in data mining research.International[J].Journal of Information Technology & Decision Making,2006,5(4):597-604. |
[44] | BELLINGER C,DRUMMOND C,JAPKOWICZ N.Manifold- based synthetic oversampling with manifold conformance estimation[J].Machine Learning,2018,107(3):605-637. |
[45] | CUI Y,MA H,SAHA T.Improvement of power transformer insulation diagnosis using oil characteristics data preprocessed by SMOTEBoosttechnique[J].IEEE Transactions on Dielectrics and Electrical Insulation,2014,21(5):2363-2373. |
[46] | GU J,JIAO L,LIU F,et al.Random subspace based ensemble sparse representation[J].Pattern Recognition,2018(74):544-555. |
[47] | KEOGH E,XI X,WEI L C A.Ratanamahatana.UCRTime Series Classification/ClusteringPage[OL].http://www.cs.ucr.edu/~eamonn/time_series_data. |
[48] | WEI L,KEOGH E J.Semi-Supervised Time Series Classification[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2006:748-753. |
[49] | GAO J W,LIANG J Y.Research and advancement of classification method of imbalanced data sets[J].Computer Sciense,2008,35:10-13. |
[50] | LI K W,YANG L,LIU W Y,et al.Unbalanced Data Classification Method Based on RSBoost Algorithm[J].Computer Scien-ce,2015,42(9):249-252. |
[51] | CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357. |
[1] | 董明刚,姜振龙,敬超. 基于海林格距离和SMOTE的多类不平衡学习算法[J]. 计算机科学, 2020, 47(1): 102-109. |
[2] | 韩慧,王黎明,柴玉梅,刘箴. 基于强化表征学习深度森林的文本情感分类[J]. 计算机科学, 2019, 46(7): 172-179. |
[3] | 金旭, 王磊, 孙国梓, 李华康. 一种基于质心空间的不均衡数据欠采样方法[J]. 计算机科学, 2019, 46(2): 50-55. |
[4] | 王莉, 陈红梅. 基于NKSMOTE算法的非平衡数据集分类方法[J]. 计算机科学, 2018, 45(9): 260-265. |
[5] | 陈圣灵,沈思淇,李东升. 基于样本权重更新的不平衡数据集成学习方法[J]. 计算机科学, 2018, 45(7): 31-37. |
[6] | 李珊,饶文碧. 基于视频的矿井中人体运动区域检测[J]. 计算机科学, 2018, 45(4): 291-295. |
[7] | 熊婧,高岩,王雅瑜. 基于Adaboost算法的软件缺陷预测模型[J]. 计算机科学, 2016, 43(7): 186-190. |
[8] | 皮嘉立,巫正中,陈卓. 基于Adaboost-CSHG的特定类目标跟踪识别[J]. 计算机科学, 2016, 43(4): 318-321. |
[9] | 宋相法,曹志伟,郑逢斌,焦李成. 基于随机子空间核极端学习机集成的高光谱遥感图像分类[J]. 计算机科学, 2016, 43(3): 301-304. |
[10] | 霍芋霖,符意德. 基于Zynq的人脸检测设计[J]. 计算机科学, 2016, 43(10): 322-325. |
[11] | 张朝晖,刘永霞,雷 倩. 基于SC-AdaBoost的图像目标检测[J]. 计算机科学, 2015, 42(7): 309-313. |
[12] | 田红梅,彭 博,李天瑞,谢宗霞. 基于监督学习的日冕暗化检测与提取算法[J]. 计算机科学, 2015, 42(5): 47-50. |
[13] | 朱二喜,徐敏. 一种新型智能交通违章信息采集器的研究与设计[J]. 计算机科学, 2014, 41(Z11): 478-481. |
[14] | 黄秀清,黄巍,高强,陆云,陈传波. 基于嘴部状态分类的内唇开度估计算法[J]. 计算机科学, 2014, 41(5): 296-298. |
[15] | 谭爱平,陈浩,吴伯桥. 基于SVM的网络入侵检测集成学习算法[J]. 计算机科学, 2014, 41(2): 197-200. |
|