计算机科学 ›› 2018, Vol. 45 ›› Issue (6A): 22-27, 57.

• 综述研究 • 上一篇    下一篇

不平衡数据分类研究综述

赵楠,张小芳,张利军   

  1. 西北工业大学计算机学院 西安710000
  • 出版日期:2018-06-20 发布日期:2018-08-03
  • 作者简介:赵 楠(1991-),女,硕士生,主要研究方向为数据挖掘;张小芳(1971-),女,博士,副教授,CCF会员,主要研究方向为软件工程、数据库技术;张利军(1978-),男,博士,讲师,CCF会员,主要研究方向为数据挖掘、分布式数据库,E-mail:zhanglijun@nwpu.edu.cn(通信作者)。
  • 基金资助:
    中央高校基本科研业务费专项资金(3102015JSJ0004),国家高技术研究发展计划(863)项目(2015AA015307),国家自然科学基金(61402370)资助

Overview of Imbalanced Data Classification

ZHAO Nan,ZHANG Xiao-fang,ZHANG Li-jun   

  1. School of Computer Science,Northwestern Polytechnical University,Xi’an 710000,China
  • Online:2018-06-20 Published:2018-08-03

摘要: 在很多应用领域中,数据的类别分布不平衡,如何对其正确分类是数据挖掘和机器学习领域中的研究热点。经典的数据分类算法未考虑数据类别的不平衡性,认为类别之间的误分类代价相同,导致不平衡数据分类的效果不理想。针对数据分类的各个步骤,相继提出了不同的不平衡数据分类处理方法。对多年来的相关研究成果进行归类分析,从特征选择、数据分布调整、分类算法、分类结果评估等几个方面系统地介绍了相关方法,并探讨了进一步的探索方向。

关键词: 不平衡数据分类, 不平衡数据的特征选择, 不平衡分类评估, 数据分布调整, 不平衡数据分类算法

Abstract: Imbalanced data classification has been drawn significant attention from research community in last decade.Because of the assumption of relatively balanced class distribution and equal misclassification costs,most standard classifiers do not perform well with imbalanced data classification.In view of various phases of data classification,different imbalanced data classification methods have been proposed.The relevant research achievements over the years were analyzed,and various approaches with imbalanced data were introduced from the view of feature selection,adjustment of the data distribution,classification algorithm and classifier evaluation.The future trends and research issues that still need to be faced in imbalanced data classification were discussed in the end.

Key words: Imbalanced data classification, Feature selection for imbalanced data, Imbalanced classification assessment, Adjustment of data distribution, Classification algorithm for imbalanced data

中图分类号: 

  • TP311
[1]HAN J,PEI J,KAMBER M.Data mining:concepts and techniques[M].Elsevier,2011:162-164.<br /> [2]CHAWLA N,JAPKOWICZ N,KOTCZ A,et al.Special Issue on Learning from Imbalanced Data Sets [J].ACM SIGKDD Explorations Newsletter,2004,6(1):1-6.<br /> [3]CHEN X,WASIKOWSKI M.Fast:a roc-based feature selection metric for small samples and imbalanced data classification problems[C]∥14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2008:124-132.<br /> [4]FORMAN G.An extensive empirical study of feature selection metrics for text classification[J].Journal of machine learning research,2003,3(2):1289-1305.<br /> [5]MEMBER M W,CHEN X W.Combating the Small Sample Class Imbalance Problem Using Feature Selection[J].IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1388-1400.<br /> [6]VAN D P P,VAN S M.A bias-variance analysis of a real world learning problem:The CoIL challenge 2000[J].Machine Lear-ning,2004,57(1):177-195.<br /> [7]ELKAN C.Magical thinking in data mining:lessons from CoIL challenge 2000[C]∥Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2001:426-431.<br /> [8]GUYON I,ELISSEEFF A.An introduction to variable and feature selection[J].Journal of Machine Learning Research,2003,3(6):1157-1182.<br /> [9]MOAYEDIKIA A,ONG K L,BOO Y L,et al.Feature selection for high dimensional imbalanced class data using harmony search[J].Engineering Applications of Artificial Intelligence,2017,57(C):38-49.<br /> [10]王杰,李德玉,王素格.面向非平衡文本情感分类的TSF特征选择方法[J].计算机科学,2016,43(10):206-210,224.<br /> [11]MLADENIC D,GROBELNIK M.Feature selection for unba- lanced class distribution and naive bayes[C]∥ICML.1999:258-267.<br /> [12]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of artificial intelligence research,2002,16(1):321-357.<br /> [13]CHAWLA N V,LAZAREVIC A,HALL L O,et al.SMOTEBoost:Improving prediction of the minority class in boosting[C]∥European Conference on Principles of Data Mining and Know-ledge Discovery.Springer Berlin Heidelberg,2003:107-119.<br /> [14]熊冰妍,王国胤,邓维斌.基于样本权重的不平衡数据欠抽样方法[J].计算机研究与发展,2016,53(11):2613-2622.<br /> [15]KUBAT M,MATWIN S.Addressing the curse of imbalanced training sets:one-sided selection[C]∥ICML.1997:179-186.<br /> [16]HART P E.The Condensed Nearest Neighbor Rule[J].IEEE Transactions on Information Theory,1968,14:515-516.<br /> [17]LAURIKKALA J.Improving identification of difficult small classes by balancing class distribution[C]∥Conference on Artificial Intelligence in Medicine in Europe.Springer Berlin Heidelberg,2001:63-66.<br /> [18]胡小生,张润晶,钟勇.两层聚类的类别不平衡数据挖掘算法[J].计算机科学,2013,40(11):271-275.<br /> [19]李克文,杨磊,刘文英,等.基于RSBoost算法的不平衡数据分类方法[J].计算机科学,2015,42(9):249-252.<br /> [20]CHAN P K,STOLFO S J.Toward Scalable Learning with Non-Uniform Class and Cost Distributions:A Case Study in Credit Card Fraud Detection[C]∥KDD.1998:164-168.<br /> [21]SUN Z,SONG Q,ZHU X,et al.A novel ensemble method for classifying imbalanced data[J].Pattern Recognition,2015,48(5):1623-1637.<br /> [22]KITTLER J,HATEF M,DUIN R P W,et al.On combining classifiers[J].IEEE transactions on pattern analysis and machine intelligence,1998,20(3):226-239.<br /> [23]SCH LKOPF B,PLATT J C,SHAWE-TAYLOR J,et al.Estimating the support of a high-dimensional distribution[J].Neural computation,2001,13(7):1443-1471.<br /> [24]COHEN G,HILARIO M,PELLEGRINI C.One-class support vector machines with a conformal kernel.a case study in handling class imbalance[C]∥Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR).Springer Berlin Heidelberg,2004:850-858.<br /> [25]MANEVITZ L M,YOUSEF M.One-class SVMs for document classification[J].Journal of Machine Learning Research,2001,2(1):139-154.<br /> [26]ELKAN C.The foundations of cost-sensitive learning[C]∥International Joint Conference on Artificial Intelligence.2001:973-978.<br /> [27]DOMINGOS P.Metacost:A general method for making classi- fiers cost-sensitive[C]∥Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,1999:155-164.<br /> [28]蒋盛益,谢照青,余雯.基于代价敏感的朴素贝叶斯不平衡数据分类研究[J].计算机研究与发展,2011,48(S1):387-390.<br /> [29]CHAI X,DENG L,YANG Q,et al.Test-cost sensitive naive bayes classification[C]∥IEEE International Conference on Data Mining,2004(ICDM’04).IEEE,2004:51-58.<br /> [30]FAN W,STOLFO S J,ZHANG J,et al.AdaCost:misclassification cost-sensitive boosting[C]∥ICML.1999:97-105.<br /> [31]SUN Y,KAMEL M S,WANG Y.Boosting for learning multiple classes with imbalanced class distribution[C]∥Sixth International Conference on Data Mining (ICDM’06).IEEE,2006:592-602.<br /> [32]李秋洁,茅耀斌,王执铨.基于Boosting的不平衡数据分类算法研究[J].计算机科学,2011,38(12):224-228.<br /> [33]李雄飞,李军,董元方,等.一种新的不平衡数据学习算法PCBoost[J].计算机学报,2012,35(2):202-209.<br /> [34]袁兴梅,杨明,杨杨.一种面向不平衡数据的结构化SVM集成分类器[J].模式识别与人工智能,2013,26(3):315-320.<br /> [35]ARUNASALAM B,CHAWLA S.CCCS:a top-down associative classifier for imbalanced class distribution[C]∥12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2006:517-522.<br /> [36]PATEL H,THAKUR G S.A Hybrid Weighted Nearest Neighbor Approach to Mine Imbalanced Data[C]∥International Conference on Data Mining (DMIN).2016:106.<br /> [37]IMAM T,KAI M T,KAMRUZZAMAN J.z-SVM:An SVM for Improved Classification of Imbalanced Data[C]∥Australasian Joint Conference on Artificial Intelligence.Springer Berlin Heidelberg,2006:264-273.<br /> [38]KUBAT M,HOLTE R C,MATWIN S.Machine learning for the detection of oil spills in satellite radar images[J].Machine Learning,1998,30(2):195-215.<br /> [39]BRADLEY A P.The use of the area under the ROC curve in the evaluation of machine learning algorithms[M].Elsevier Science Inc.,1997.<br /> [40]FAWCETT T.An introduction to ROC analysis[J].Pattern Recognition Letters,2006,27(8):861-874.<br /> [41]PROVOST F,DOMINGOS P.Tree induction for probability- based ranking[J].Machine Learning,2003,52(3):199-215.<br /> [42]HAND D J,TILL R J.A simple generalisation of the area under the ROC curve for multiple class classification problems[J].Machine Learning,2001,45(2):171-186.<br /> [43]DAVIS J,GOADRICH M.The relationship between Precision-Recall and ROC curves[C]∥23rd International Conference on Machine Learning.ACM,2006:233-240.<br /> [44]DRUMMOND C,HOLTE R C.Cost curves:An improved method for visualizing classifier performance[J].Machine learning,2006,65(1):95-130.
[1] 曹雅茜, 黄海燕. 基于概率采样和集成学习的不平衡数据分类算法[J]. 计算机科学, 2019, 46(5): 203-208.
[2] 李秋洁,茅耀斌,王执锉. 基于Boosting的不平衡数据分类算法研究[J]. 计算机科学, 2011, 38(12): 224-228.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 编辑部. 新网站开通,欢迎大家订阅![J]. 计算机科学, 2018, 1(1): 1 .
[2] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75, 88 .
[3] 夏庆勋,庄毅. 一种基于局部性原理的远程验证机制[J]. 计算机科学, 2018, 45(4): 148 -151, 162 .
[4] 厉柏伸,李领治,孙涌,朱艳琴. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157 -162 .
[5] 王欢,张云峰,张艳. 一种基于CFDs规则的修复序列快速判定方法[J]. 计算机科学, 2018, 45(3): 311 -316 .
[6] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[7] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[8] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[9] 刘琴. 计算机取证过程中基于约束的数据质量问题研究[J]. 计算机科学, 2018, 45(4): 169 -172 .
[10] 钟菲,杨斌. 基于主成分分析网络的车牌检测方法[J]. 计算机科学, 2018, 45(3): 268 -273 .