计算机科学 ›› 2019, Vol. 46 ›› Issue (5): 203-208.doi: 10.11896/j.issn.1002-137X.2019.05.031

• 人工智能 • 上一篇    下一篇

基于概率采样和集成学习的不平衡数据分类算法

曹雅茜, 黄海燕   

  1. (华东理工大学化工过程先进控制和优化技术教育部重点实验室 上海200237)
  • 发布日期:2019-05-15
  • 作者简介:曹雅茜(1993-),女,硕士生,主要研究方向为机器学习、数据挖掘,E-mail:yaxi_cao@163.com;黄海燕(1972-),女,博士,副教授,主要研究方向为控制与优化、复杂工业过程建模,E-mail:huanghong@ecust.edu.cn(通信作者)。

Imbalanced Data Classification Algorithm Based on Probability Sampling and Ensemble Learning

CAO Ya-xi, HUANG Hai-yan   

  1. (Key Laboratory of Advanced Process Control and Optimization for Chemical Processes (East China University of Science and Technology),Ministry of Education,Shanghai 200237,China)
  • Published:2019-05-15

摘要: 集成学习由于泛化能力强,被广泛应用于信息检索、图像处理、生物学等类别不平衡的场景。为了提高算法在不平衡数据上的分类效果,文中提出一种基于采样平衡和特征选择的集成学习算法OBPD-EFSBoost。该算法主要包括3个步骤:首先,依据少数类高斯混合分布得到的概率模型,进行过采样构造平衡数集,扩大少数类的潜在决策域;其次,每轮训练个体分类器时,根据上一轮的错分样本综合考虑样本和特征的加权,过滤冗余噪声特征;最后,通过个体分类器的加权投票得到最终的集成分类器。8组UCI数据分类结果表明,该算法不仅有效提高了少数类的分类精度,同时还弥补了Boosting类算法对噪声特征敏感的缺陷,具有较强的鲁棒性。

关键词: 不平衡数据分类, 概率分布, 集成学习, 特征选择

Abstract: Ensemble learning has attracted wide attention in imbalanced category circumstances such as information retrieval,image processing,and biology due to its generalization ability.To improve the performance of classification algorithm on imbalanced data,this paper proposed an ensemble learning algorithm,namely Oversampling Based on Probabi-lity Distribution-Embedding Feature Selection in Boosting (OBPD-EFSBoost).This algorithm mainly includes three steps.Firstly,the original data are oversampled based on probability distribution estimation to construct a balanced dataset.Secondly,when training base classifiers in each round,OBPD-EFSBoost increases the weight of misclassified samples,and considers the effect of noise feature on classification results,thus filtering the redundant noise feature.Finally,the eventual ensemble classifier is obtained through weighted voting on different base classifiers.Experimental results show that the algorithm not only improves the classification accuracy for minority class,but also eliminates the sensitivity of Boosting to noise features,and it has strong robustness.

Key words: Ensemble learning, Feature selection, Imbalanced data classification, Probability distribution

中图分类号: 

  • TP391
[1]POZZOLO A D,CAELEN O,BORGNE Y A L,et al.Learned Lessons in Credit Card Fraud Detection from A Practitioner Perspective[J].Expert Systems with Applications,2014,41(10):4915-4928.
[2]PARVIN H,MINAEIBIDGOLI B,ALINEJADROKNY H.ANew Imbalanced Learning and Dictions Tree Method for Breast Cancer Diagnosis[J].Journal of Bionanoscience,2013,7(6):673-678.
[3]LARADJI I H,ALSHAYEB M,GHOUTI L.Software DefectPrediction Using Ensemble Learning on Selected Features[J].Information & Software Technology,2015,58:388-402.
[4]ZHANG C,WANG G,ZHOU Y,et al.A new approach for imbalanced data classification based on minimize loss learning[C]∥IEEE Second International Conference on Data Science in Cyberspace.IEEE,2017:82-87.
[5]CAO P,YANG J,LI W,et al.Hybrid Sampling AlgorithmBased on Probability Distribution Estimation[J].Control and Decision,2014,29(5):815-520.(in Chinese)曹鹏,李博,栗伟,等.基于概率分布估计的混合采样算法[J].控制与决策,2014,29(5):815-520.
[6]FREUND,YOAV,SCHAPIRE,et al.A Decision-theoretic Generalization of On-line Learning and An Application to Boosting[C]∥European Conference on Computational Learning Theory.Springer,Berlin,Heidelberg,1995:23-37.
[7]CHAWLA N V,BOWYER K W,Hall L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[8]SEIFFERT C,KHOSHGOFTAAR T M,HULSE J V,et al.
RUSBoost:A Hybrid Approach to Alleviating Class Imbalance[J].IEEE Transactions on Systems Man & Cybernetics Part A Systems & Humans,2009,40(1):185-197.
[9]LI K,FANG X,ZHAI J,et al.An Imbalanced Data Classification Method Driven by Boundary Samples-Boundary-Boost[C]∥International Conference on Information Science and Control Engineering.IEEE,2016:194-199.
[10]BAO L,CAO J,LI J,et al.Boosted Near-miss Under-sampling on SVM ensembles for concept detection in large-scale imba-lanced datasets[J].Neurocomputing,2016,172(C):198-206.
[11]YIN H,HUY P.An Imbalanced Feature Selection AlgorithmBased on Random Forest[J].Acta Scientiarum Naturalium Universitatis Sunyatseni,2014,53(5):59-65.
[12]YIN L,GE Y,XIAO K,et al.Feature Selection for High-dimensional Imbalanced Data[J].Neurocomputing,2013,105(3):3-11.
[13]ALIBEIGI M,HASHEMI S,HAMZEH A.Unsupervised Fea-ture Selection Based on the Distribution of Features Attributed to Imbalanced Data Sets[J].International Journal of Artificial Intelligence & Expert Systems,2011,2(1):2011-2014.
[14]HANSEN L K,SALAMON P.Neural Network Ensembles[M].IEEE Computer Society,1990,12(10):993-1001.
[15]FIGUEIREDO M A T,JAIN A K.Unsupervised Learning of Finite Mixture Models[J].IEEE Transactions on Pattern Analysis &Machine Intelligence,2002,24(3):381-396.
[16]ZHANG H,LI M.RWO-Sampling:A random walk over-sampling approach to imbalanced data classification[J].Information Fusion,2014,20(1):99-116.
[17]WEISS G M.The Impact of Small Disjuncts on Classifier Lear-ning[M].Data Mining,2009:193-226.
[1] 李斌, 万源.
基于相似度矩阵学习和矩阵校正的无监督多视角特征选择
Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment
计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124
[2] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[3] 康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩.
混合改进的花授粉算法与灰狼算法用于特征选择
Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection
计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135
[4] 林夕, 陈孜卓, 王中卿.
基于不平衡数据与集成学习的属性级情感分类
Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning
计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205
[5] 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩.
融合Bert和图卷积的深度集成学习软件需求分类
Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution
计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065
[6] 王宇飞, 陈文.
基于DECORATE集成学习与置信度评估的Tri-training算法
Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment
计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043
[7] 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳.
基于共同子空间分类学习的跨媒体检索研究
Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning
计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157
[8] 储安琪, 丁志军.
基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理
Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation
计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075
[9] 孙林, 黄苗苗, 徐久成.
基于邻域粗糙集和Relief的弱标记特征选择方法
Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief
计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094
[10] 李宗然, 陈秀宏, 陆赟, 邵政毅.
鲁棒联合稀疏不相关回归
Robust Joint Sparse Uncorrelated Regression
计算机科学, 2022, 49(2): 191-197. https://doi.org/10.11896/jsjkx.210300034
[11] 任首朋, 李劲, 王静茹, 岳昆.
基于集成回归决策树的lncRNA-疾病关联预测方法
Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction
计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132
[12] 陈伟, 李杭, 李维华.
核小体定位预测的集成学习方法
Ensemble Learning Method for Nucleosome Localization Prediction
计算机科学, 2022, 49(2): 285-291. https://doi.org/10.11896/jsjkx.201100195
[13] 刘振宇, 宋晓莹.
一种可用于分类型属性数据的多变量回归森林
Multivariate Regression Forest for Categorical Attribute Data
计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189
[14] 周新民, 胡宜桂, 刘文洁, 孙荣俊.
基于多模态多层级数据融合方法的城市功能识别研究
Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method
计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220
[15] 张叶, 李志华, 王长杰.
基于核密度估计的轻量级物联网异常流量检测方法
Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method
计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!