计算机科学 ›› 2015, Vol. 42 ›› Issue (9): 249-252.doi: 10.11896/j.issn.1002-137X.2015.09.048

• 人工智能 • 上一篇    下一篇

基于RSBoost算法的不平衡数据分类方法

李克文,杨磊,刘文英,刘璐,刘洪太   

  1. 中国石油大学华东计算机与通信工程学院 青岛266580,中国石油大学华东计算机与通信工程学院 青岛266580,中国石油大学华东计算机与通信工程学院 青岛266580,中国石油大学华东计算机与通信工程学院 青岛266580,中国石油大学华东计算机与通信工程学院 青岛266580
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受山东省自然科学基金(ZR2013FL034)资助

Classification Method of Imbalanced Data Based on RSBoost

LI Ke-wen, YANG Lei, LIU Wen-ying, LIU Lu and LIU Hong-tai   

  • Online:2018-11-14 Published:2018-11-14

摘要: 不平衡数据的分类问题在多个应用领域中普遍存在,已成为数据挖掘和机器学习领域的研究热点。提出了一种新的不平衡数据分类方法RSBoost,以解决传统分类方法对于少数类识别率不高和分类效率低的问题。该方法采用SMOTE方法对少数类进行过采样处理,然后对整个数据集进行随机欠采样处理,以改善整个数据集的不平衡性,再将其与Boosting算法相结合来对数据进行分类。通过实验对比了5种方法在多个公共数据集上的分类效果和分类效率,结果表明该方法具有较高的分类识别率和分类效率。

关键词: 不平衡数据,组合数据采样,Boosting,RSBoost

Abstract: The problem of class imbalance which is very common to many application domains becomes the research hotspot in data mining and machine learning.We presented a new classification method of imbalance data,called RSBoost,to increase the recognition rate of minority class and the classification efficiency.This approach uses SMOTE(synthetic minority over-sampling technique) and random under-sampling to balance the data sets,and then uses boosting method to optimize the classification performance.We conducted experiments using several public data sets to eva-luate the performances of RSBoost and other four methods.The experimental results show that the approach proposed in this article can improve the classification performance and efficiency of imbalance data sets.

Key words: Imbalanced data,Mixed data sampling,Boosting,RSBoost

[1] Batista G E A P A,Prati R C,Monard M C.A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data[J].ACM SIGKDD Explorations Newsletter,2004,6(1):20-29
[2] Gao Jia-wei,Liang Ji-ye.Research and Advancement of Classification Method of Imbalanced Data Sets [J].Computer Science,2008,5(4):10-13
[3] Chawla N V,Bowyer K W,Hall L O,et al.SMOTE:Synthetic Minority Over-SamplingTechnique[J].Journal of Artificial Intelligence Research,2002,6(1):321-357
[4] Laurikkala J.Improving Identification of Difficult Small Classes by Balancing Class Distribution[C]∥Proceedings of the 8th Conference on AI in Medicine Europe:Artificial.2001:63-66
[5] Drummond C,Holte R C.C4.5,Class Imbalance and Cost Sensitivity:Why Under-Sampling beats Over-Sampling[C]∥Proceedings of the ICML’03 Workshop on Learning from.2003
[6] Seiffert C,Khoshgoftaar T M,Van Hulse J,et al.RUSBoost:A Hybrid Approach to Alleviating Class Imbalance[J].IEEE T ransactions on System,MAN,and Cybernetics-PART A:Systems and Humans,2010,0(1):185-197
[7] Batista G E,Prati R C,Monard M C.A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data[J].ACM SIGKDD Explorations Newsletter,2004,6(1):20-29
[8] Chawla N V,Cieslak D A,Hall L O,et al.Automatically Coun-tering Imbalance and Its Empirical Relationship to Cost[J].Data Mining and Knowledge Discovery,2008,17(2):225-252
[9] Wang C X,Pan Z M,Ma C S,et al.Classification for Imbalanced Dataset of Improved Weighted KNN Algorithm[J].Computer Engineering,2012,38(20):160-163
[10] Joshi M V,Kumar V,Agarwal R.Evaluating Boosting Algo-rithms to Classify Rare Classes:Comparison and Improvements[C]∥Proc of the 1st IEEE International Conference on Data Mining.San Jose,USA,2001:257-264
[11] Chawla N V,Lazarevic A,Hall L O,et al.Smoteboost:Improving Prediction of the Minority Class in Boosting[C]∥Proc.of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases.Dubrovnik,Croatia,2003:107-119
[12] Li X F,Li J,Dong Y F,et al.A new learning algorithm for imbalanced data-PCBoost[J].Chinese Journal of Computers,2012,35(2):202-209
[13] Hothorn T,Buehlmann P,Kneib T,et al.mboost:Model-based boosting 2.0[J].Journal of Machine Learning Research,2010(11):2109-2113
[14] Ganganwar V.An overview of classification algorithms for imbalanced datasets[J].International Journal of Emerging Technology and Advanced Engineering,2012,2(4):42-47
[15] Gao S.An ensemble classifier learning approach to ROC optimization;Pattern Recognition[C]∥18th International Conference on ICPR.2006:679-682
[16] Hand D J,TillR J.A simple generalization of the area under the ROC curve for multiple[J].Machine Learning,2001,45(2):172-186

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!