计算机科学 ›› 2019, Vol. 46 ›› Issue (5): 203-208.doi: 10.11896/j.issn.1002-137X.2019.05.031

• 人工智能 • 上一篇    下一篇

基于概率采样和集成学习的不平衡数据分类算法

曹雅茜, 黄海燕   

  1. (华东理工大学化工过程先进控制和优化技术教育部重点实验室 上海200237)
  • 发布日期:2019-05-15
  • 作者简介:曹雅茜(1993-),女,硕士生,主要研究方向为机器学习、数据挖掘,E-mail:yaxi_cao@163.com;黄海燕(1972-),女,博士,副教授,主要研究方向为控制与优化、复杂工业过程建模,E-mail:huanghong@ecust.edu.cn(通信作者)。

Imbalanced Data Classification Algorithm Based on Probability Sampling and Ensemble Learning

CAO Ya-xi, HUANG Hai-yan   

  1. (Key Laboratory of Advanced Process Control and Optimization for Chemical Processes (East China University of Science and Technology),Ministry of Education,Shanghai 200237,China)
  • Published:2019-05-15

摘要: 集成学习由于泛化能力强,被广泛应用于信息检索、图像处理、生物学等类别不平衡的场景。为了提高算法在不平衡数据上的分类效果,文中提出一种基于采样平衡和特征选择的集成学习算法OBPD-EFSBoost。该算法主要包括3个步骤:首先,依据少数类高斯混合分布得到的概率模型,进行过采样构造平衡数集,扩大少数类的潜在决策域;其次,每轮训练个体分类器时,根据上一轮的错分样本综合考虑样本和特征的加权,过滤冗余噪声特征;最后,通过个体分类器的加权投票得到最终的集成分类器。8组UCI数据分类结果表明,该算法不仅有效提高了少数类的分类精度,同时还弥补了Boosting类算法对噪声特征敏感的缺陷,具有较强的鲁棒性。

关键词: 不平衡数据分类, 集成学习, 特征选择, 概率分布

Abstract: Ensemble learning has attracted wide attention in imbalanced category circumstances such as information retrieval,image processing,and biology due to its generalization ability.To improve the performance of classification algorithm on imbalanced data,this paper proposed an ensemble learning algorithm,namely Oversampling Based on Probabi-lity Distribution-Embedding Feature Selection in Boosting (OBPD-EFSBoost).This algorithm mainly includes three steps.Firstly,the original data are oversampled based on probability distribution estimation to construct a balanced dataset.Secondly,when training base classifiers in each round,OBPD-EFSBoost increases the weight of misclassified samples,and considers the effect of noise feature on classification results,thus filtering the redundant noise feature.Finally,the eventual ensemble classifier is obtained through weighted voting on different base classifiers.Experimental results show that the algorithm not only improves the classification accuracy for minority class,but also eliminates the sensitivity of Boosting to noise features,and it has strong robustness.

Key words: Imbalanced data classification, Ensemble learning, Feature selection, Probability distribution

中图分类号: 

  • TP391
[1]POZZOLO A D,CAELEN O,BORGNE Y A L,et al.Learned Lessons in Credit Card Fraud Detection from A Practitioner Perspective[J].Expert Systems with Applications,2014,41(10):4915-4928.
[2]PARVIN H,MINAEIBIDGOLI B,ALINEJADROKNY H.ANew Imbalanced Learning and Dictions Tree Method for Breast Cancer Diagnosis[J].Journal of Bionanoscience,2013,7(6):673-678.
[3]LARADJI I H,ALSHAYEB M,GHOUTI L.Software DefectPrediction Using Ensemble Learning on Selected Features[J].Information & Software Technology,2015,58:388-402.
[4]ZHANG C,WANG G,ZHOU Y,et al.A new approach for imbalanced data classification based on minimize loss learning[C]∥IEEE Second International Conference on Data Science in Cyberspace.IEEE,2017:82-87.
[5]CAO P,YANG J,LI W,et al.Hybrid Sampling AlgorithmBased on Probability Distribution Estimation[J].Control and Decision,2014,29(5):815-520.(in Chinese)曹鹏,李博,栗伟,等.基于概率分布估计的混合采样算法[J].控制与决策,2014,29(5):815-520.
[6]FREUND,YOAV,SCHAPIRE,et al.A Decision-theoretic Generalization of On-line Learning and An Application to Boosting[C]∥European Conference on Computational Learning Theory.Springer,Berlin,Heidelberg,1995:23-37.
[7]CHAWLA N V,BOWYER K W,Hall L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[8]SEIFFERT C,KHOSHGOFTAAR T M,HULSE J V,et al.
RUSBoost:A Hybrid Approach to Alleviating Class Imbalance[J].IEEE Transactions on Systems Man & Cybernetics Part A Systems & Humans,2009,40(1):185-197.
[9]LI K,FANG X,ZHAI J,et al.An Imbalanced Data Classification Method Driven by Boundary Samples-Boundary-Boost[C]∥International Conference on Information Science and Control Engineering.IEEE,2016:194-199.
[10]BAO L,CAO J,LI J,et al.Boosted Near-miss Under-sampling on SVM ensembles for concept detection in large-scale imba-lanced datasets[J].Neurocomputing,2016,172(C):198-206.
[11]YIN H,HUY P.An Imbalanced Feature Selection AlgorithmBased on Random Forest[J].Acta Scientiarum Naturalium Universitatis Sunyatseni,2014,53(5):59-65.
[12]YIN L,GE Y,XIAO K,et al.Feature Selection for High-dimensional Imbalanced Data[J].Neurocomputing,2013,105(3):3-11.
[13]ALIBEIGI M,HASHEMI S,HAMZEH A.Unsupervised Fea-ture Selection Based on the Distribution of Features Attributed to Imbalanced Data Sets[J].International Journal of Artificial Intelligence & Expert Systems,2011,2(1):2011-2014.
[14]HANSEN L K,SALAMON P.Neural Network Ensembles[M].IEEE Computer Society,1990,12(10):993-1001.
[15]FIGUEIREDO M A T,JAIN A K.Unsupervised Learning of Finite Mixture Models[J].IEEE Transactions on Pattern Analysis &Machine Intelligence,2002,24(3):381-396.
[16]ZHANG H,LI M.RWO-Sampling:A random walk over-sampling approach to imbalanced data classification[J].Information Fusion,2014,20(1):99-116.
[17]WEISS G M.The Impact of Small Disjuncts on Classifier Lear-ning[M].Data Mining,2009:193-226.
[1] 钟熙, 孙祥娥. 基于Kmeans++聚类的朴素贝叶斯集成方法研究[J]. 计算机科学, 2019, 46(6A): 439-441.
[2] 曾庆田, 刘晨征, 倪维健, 段华. 面向序数回归的组合特征提取方法[J]. 计算机科学, 2019, 46(6): 69-74.
[3] 胡海根, 孔祥勇, 周乾伟, 管秋, 陈胜勇. 基于深层卷积残差网络集成的黑色素瘤分类方法[J]. 计算机科学, 2019, 46(5): 247-253.
[4] 茹锋, 徐锦, 常琪, 阚丹会. 一种用于影像遗传学关联分析的高阶统计量结构化稀疏算法[J]. 计算机科学, 2019, 46(4): 66-72.
[5] 袁丁,王茜,邓李维. 聚类辅助特征对齐的域适应方法[J]. 计算机科学, 2019, 46(3): 221-226.
[6] 伍杰华,沈静,周蓓. 基于社区特征的平衡模块度最大化社交链接预测模型[J]. 计算机科学, 2019, 46(3): 253-259.
[7] 杨德杰, 章宁, 袁戟, 白璐. 基于堆栈降噪自编码网络的个人信用风险评估方法[J]. 计算机科学, 2019, 46(10): 7-13.
[8] 刘平平, 张文华, 卢振泰, 陈韬, 李国新. 基于放射组学特征的胃肠道间质瘤的分类预测[J]. 计算机科学, 2019, 46(1): 285-290.
[9] 许召召, 李京华, 陈同林, 李昕洁. 融合SMOTE与Filter-Wrapper的朴素贝叶斯决策树算法及其应用[J]. 计算机科学, 2018, 45(9): 65-69, 74.
[10] 南世慧, 魏伟, 吴华清, 邹金蓉, 赵志文. 基于KNN和GBDT的Web服务器指纹识别技术[J]. 计算机科学, 2018, 45(8): 141-145.
[11] 陈圣灵,沈思淇,李东升. 基于样本权重更新的不平衡数据集成学习方法[J]. 计算机科学, 2018, 45(7): 31-37.
[12] 赵楠, 张小芳, 张利军. 不平衡数据分类研究综述[J]. 计算机科学, 2018, 45(6A): 22-27, 57.
[13] 黄铉. 特征降维技术的研究与进展[J]. 计算机科学, 2018, 45(6A): 16-21, 53.
[14] 陈福才, 李思豪, 张建朋, 黄瑞阳. 基于标签关系改进的多标签特征选择算法[J]. 计算机科学, 2018, 45(6): 228-234.
[15] 董红斌,石丽,李涛. 一种改进的microRNA预测模型集成方法[J]. 计算机科学, 2018, 45(2): 69-75.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 夏庆勋,庄毅. 一种基于局部性原理的远程验证机制[J]. 计算机科学, 2018, 45(4): 148 -151, 162 .
[2] 厉柏伸,李领治,孙涌,朱艳琴. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157 -162 .
[3] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[4] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[5] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[6] 王振武,吕小华,韩晓辉. 基于四叉树分割的地形LOD技术综述[J]. 计算机科学, 2018, 45(4): 34 -45 .
[7] 李珊,饶文碧. 基于视频的矿井中人体运动区域检测[J]. 计算机科学, 2018, 45(4): 291 -295 .
[8] 廖星,袁景凌,陈旻骋. 一种自适应权重的并行PSO快速装箱算法[J]. 计算机科学, 2018, 45(3): 231 -234, 273 .
[9] 杨羽琦,章国安,金喜龙. 车载自组织网络中基于车辆密度的双簇头路由协议[J]. 计算机科学, 2018, 45(4): 126 -130 .
[10] 瞿中,赵从梅. 一种抗遮挡的自适应尺度目标跟踪算法[J]. 计算机科学, 2018, 45(4): 296 -300 .