Computer Science ›› 2019, Vol. 46 ›› Issue (5): 203-208.doi: 10.11896/j.issn.1002-137X.2019.05.031

Previous Articles     Next Articles

Imbalanced Data Classification Algorithm Based on Probability Sampling and Ensemble Learning

CAO Ya-xi, HUANG Hai-yan   

  1. (Key Laboratory of Advanced Process Control and Optimization for Chemical Processes (East China University of Science and Technology),Ministry of Education,Shanghai 200237,China)
  • Published:2019-05-15

Abstract: Ensemble learning has attracted wide attention in imbalanced category circumstances such as information retrieval,image processing,and biology due to its generalization ability.To improve the performance of classification algorithm on imbalanced data,this paper proposed an ensemble learning algorithm,namely Oversampling Based on Probabi-lity Distribution-Embedding Feature Selection in Boosting (OBPD-EFSBoost).This algorithm mainly includes three steps.Firstly,the original data are oversampled based on probability distribution estimation to construct a balanced dataset.Secondly,when training base classifiers in each round,OBPD-EFSBoost increases the weight of misclassified samples,and considers the effect of noise feature on classification results,thus filtering the redundant noise feature.Finally,the eventual ensemble classifier is obtained through weighted voting on different base classifiers.Experimental results show that the algorithm not only improves the classification accuracy for minority class,but also eliminates the sensitivity of Boosting to noise features,and it has strong robustness.

Key words: Ensemble learning, Feature selection, Imbalanced data classification, Probability distribution

CLC Number: 

  • TP391
[1]POZZOLO A D,CAELEN O,BORGNE Y A L,et al.Learned Lessons in Credit Card Fraud Detection from A Practitioner Perspective[J].Expert Systems with Applications,2014,41(10):4915-4928.
[2]PARVIN H,MINAEIBIDGOLI B,ALINEJADROKNY H.ANew Imbalanced Learning and Dictions Tree Method for Breast Cancer Diagnosis[J].Journal of Bionanoscience,2013,7(6):673-678.
[3]LARADJI I H,ALSHAYEB M,GHOUTI L.Software DefectPrediction Using Ensemble Learning on Selected Features[J].Information & Software Technology,2015,58:388-402.
[4]ZHANG C,WANG G,ZHOU Y,et al.A new approach for imbalanced data classification based on minimize loss learning[C]∥IEEE Second International Conference on Data Science in Cyberspace.IEEE,2017:82-87.
[5]CAO P,YANG J,LI W,et al.Hybrid Sampling AlgorithmBased on Probability Distribution Estimation[J].Control and Decision,2014,29(5):815-520.(in Chinese)曹鹏,李博,栗伟,等.基于概率分布估计的混合采样算法[J].控制与决策,2014,29(5):815-520.
[6]FREUND,YOAV,SCHAPIRE,et al.A Decision-theoretic Generalization of On-line Learning and An Application to Boosting[C]∥European Conference on Computational Learning Theory.Springer,Berlin,Heidelberg,1995:23-37.
[7]CHAWLA N V,BOWYER K W,Hall L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[8]SEIFFERT C,KHOSHGOFTAAR T M,HULSE J V,et al.
RUSBoost:A Hybrid Approach to Alleviating Class Imbalance[J].IEEE Transactions on Systems Man & Cybernetics Part A Systems & Humans,2009,40(1):185-197.
[9]LI K,FANG X,ZHAI J,et al.An Imbalanced Data Classification Method Driven by Boundary Samples-Boundary-Boost[C]∥International Conference on Information Science and Control Engineering.IEEE,2016:194-199.
[10]BAO L,CAO J,LI J,et al.Boosted Near-miss Under-sampling on SVM ensembles for concept detection in large-scale imba-lanced datasets[J].Neurocomputing,2016,172(C):198-206.
[11]YIN H,HUY P.An Imbalanced Feature Selection AlgorithmBased on Random Forest[J].Acta Scientiarum Naturalium Universitatis Sunyatseni,2014,53(5):59-65.
[12]YIN L,GE Y,XIAO K,et al.Feature Selection for High-dimensional Imbalanced Data[J].Neurocomputing,2013,105(3):3-11.
[13]ALIBEIGI M,HASHEMI S,HAMZEH A.Unsupervised Fea-ture Selection Based on the Distribution of Features Attributed to Imbalanced Data Sets[J].International Journal of Artificial Intelligence & Expert Systems,2011,2(1):2011-2014.
[14]HANSEN L K,SALAMON P.Neural Network Ensembles[M].IEEE Computer Society,1990,12(10):993-1001.
[15]FIGUEIREDO M A T,JAIN A K.Unsupervised Learning of Finite Mixture Models[J].IEEE Transactions on Pattern Analysis &Machine Intelligence,2002,24(3):381-396.
[16]ZHANG H,LI M.RWO-Sampling:A random walk over-sampling approach to imbalanced data classification[J].Information Fusion,2014,20(1):99-116.
[17]WEISS G M.The Impact of Small Disjuncts on Classifier Lear-ning[M].Data Mining,2009:193-226.
[1] LI Bin, WAN Yuan. Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment [J]. Computer Science, 2022, 49(8): 86-96.
[2] HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[3] KANG Yan, WANG Hai-ning, TAO Liu, YANG Hai-xiao, YANG Xue-kun, WANG Fei, LI Hao. Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection [J]. Computer Science, 2022, 49(6A): 125-132.
[4] LIN Xi, CHEN Zi-zhuo, WANG Zhong-qing. Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning [J]. Computer Science, 2022, 49(6A): 144-149.
[5] KANG Yan, WU Zhi-wei, KOU Yong-qi, ZHANG Lan, XIE Si-yu, LI Hao. Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution [J]. Computer Science, 2022, 49(6A): 150-158.
[6] WANG Yu-fei, CHEN Wen. Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment [J]. Computer Science, 2022, 49(6): 127-133.
[7] HAN Hong-qi, RAN Ya-xin, ZHANG Yun-liang, GUI Jie, GAO Xiong, YI Meng-lin. Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning [J]. Computer Science, 2022, 49(5): 33-42.
[8] CHU An-qi, DING Zhi-jun. Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation [J]. Computer Science, 2022, 49(4): 134-139.
[9] SUN Lin, HUANG Miao-miao, XU Jiu-cheng. Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief [J]. Computer Science, 2022, 49(4): 152-160.
[10] LI Zong-ran, CHEN XIU-Hong, LU Yun, SHAO Zheng-yi. Robust Joint Sparse Uncorrelated Regression [J]. Computer Science, 2022, 49(2): 191-197.
[11] REN Shou-peng, LI Jin, WANG Jing-ru, YUE Kun. Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction [J]. Computer Science, 2022, 49(2): 265-271.
[12] CHEN Wei, LI Hang, LI Wei-hua. Ensemble Learning Method for Nucleosome Localization Prediction [J]. Computer Science, 2022, 49(2): 285-291.
[13] LIU Zhen-yu, SONG Xiao-ying. Multivariate Regression Forest for Categorical Attribute Data [J]. Computer Science, 2022, 49(1): 108-114.
[14] ZHOU Xin-min, HU Yi-gui, LIU Wen-jie, SUN Rong-jun. Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method [J]. Computer Science, 2021, 48(9): 50-58.
[15] ZHANG Ye, LI Zhi-hua, WANG Chang-jie. Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method [J]. Computer Science, 2021, 48(9): 337-344.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!