不平衡数据分类研究综述

Abstract

Abstract: Imbalanced data classification has been drawn significant attention from research community in last decade.Because of the assumption of relatively balanced class distribution and equal misclassification costs,most standard classifiers do not perform well with imbalanced data classification.In view of various phases of data classification,different imbalanced data classification methods have been proposed.The relevant research achievements over the years were analyzed,and various approaches with imbalanced data were introduced from the view of feature selection,adjustment of the data distribution,classification algorithm and classifier evaluation.The future trends and research issues that still need to be faced in imbalanced data classification were discussed in the end.

Key words: Adjustment of data distribution, Classification algorithm for imbalanced data, Feature selection for imbalanced data, Imbalanced classification assessment, Imbalanced data classification

CLC Number:

TP311

ZHAO Nan, ZHANG Xiao-fang, ZHANG Li-jun. Overview of Imbalanced Data Classification[J].Computer Science, 2018, 45(6A): 22-27.

References

[1]HAN J,PEI J,KAMBER M.Data mining:concepts and techniques[M].Elsevier,2011:162-164. [2]CHAWLA N,JAPKOWICZ N,KOTCZ A,et al.Special Issue on Learning from Imbalanced Data Sets [J].ACM SIGKDD Explorations Newsletter,2004,6(1):1-6. [3]CHEN X,WASIKOWSKI M.Fast:a roc-based feature selection metric for small samples and imbalanced data classification problems[C]∥14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2008:124-132. [4]FORMAN G.An extensive empirical study of feature selection metrics for text classification[J].Journal of machine learning research,2003,3(2):1289-1305. [5]MEMBER M W,CHEN X W.Combating the Small Sample Class Imbalance Problem Using Feature Selection[J].IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1388-1400. [6]VAN D P P,VAN S M.A bias-variance analysis of a real world learning problem:The CoIL challenge 2000[J].Machine Lear-ning,2004,57(1):177-195. [7]ELKAN C.Magical thinking in data mining:lessons from CoIL challenge 2000[C]∥Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2001:426-431. [8]GUYON I,ELISSEEFF A.An introduction to variable and feature selection[J].Journal of Machine Learning Research,2003,3(6):1157-1182. [9]MOAYEDIKIA A,ONG K L,BOO Y L,et al.Feature selection for high dimensional imbalanced class data using harmony search[J].Engineering Applications of Artificial Intelligence,2017,57(C):38-49. [10]王杰,李德玉,王素格.面向非平衡文本情感分类的TSF特征选择方法[J].计算机科学,2016,43(10):206-210,224. [11]MLADENIC D,GROBELNIK M.Feature selection for unba- lanced class distribution and naive bayes[C]∥ICML.1999:258-267. [12]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of artificial intelligence research,2002,16(1):321-357. [13]CHAWLA N V,LAZAREVIC A,HALL L O,et al.SMOTEBoost:Improving prediction of the minority class in boosting[C]∥European Conference on Principles of Data Mining and Know-ledge Discovery.Springer Berlin Heidelberg,2003:107-119. [14]熊冰妍,王国胤,邓维斌.基于样本权重的不平衡数据欠抽样方法[J].计算机研究与发展,2016,53(11):2613-2622. [15]KUBAT M,MATWIN S.Addressing the curse of imbalanced training sets:one-sided selection[C]∥ICML.1997:179-186. [16]HART P E.The Condensed Nearest Neighbor Rule[J].IEEE Transactions on Information Theory,1968,14:515-516. [17]LAURIKKALA J.Improving identification of difficult small classes by balancing class distribution[C]∥Conference on Artificial Intelligence in Medicine in Europe.Springer Berlin Heidelberg,2001:63-66. [18]胡小生,张润晶,钟勇.两层聚类的类别不平衡数据挖掘算法[J].计算机科学,2013,40(11):271-275. [19]李克文,杨磊,刘文英,等.基于RSBoost算法的不平衡数据分类方法[J].计算机科学,2015,42(9):249-252. [20]CHAN P K,STOLFO S J.Toward Scalable Learning with Non-Uniform Class and Cost Distributions:A Case Study in Credit Card Fraud Detection[C]∥KDD.1998:164-168. [21]SUN Z,SONG Q,ZHU X,et al.A novel ensemble method for classifying imbalanced data[J].Pattern Recognition,2015,48(5):1623-1637. [22]KITTLER J,HATEF M,DUIN R P W,et al.On combining classifiers[J].IEEE transactions on pattern analysis and machine intelligence,1998,20(3):226-239. [23]SCH LKOPF B,PLATT J C,SHAWE-TAYLOR J,et al.Estimating the support of a high-dimensional distribution[J].Neural computation,2001,13(7):1443-1471. [24]COHEN G,HILARIO M,PELLEGRINI C.One-class support vector machines with a conformal kernel.a case study in handling class imbalance[C]∥Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR).Springer Berlin Heidelberg,2004:850-858. [25]MANEVITZ L M,YOUSEF M.One-class SVMs for document classification[J].Journal of Machine Learning Research,2001,2(1):139-154. [26]ELKAN C.The foundations of cost-sensitive learning[C]∥International Joint Conference on Artificial Intelligence.2001:973-978. [27]DOMINGOS P.Metacost:A general method for making classi- fiers cost-sensitive[C]∥Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,1999:155-164. [28]蒋盛益,谢照青,余雯.基于代价敏感的朴素贝叶斯不平衡数据分类研究[J].计算机研究与发展,2011,48(S1):387-390. [29]CHAI X,DENG L,YANG Q,et al.Test-cost sensitive naive bayes classification[C]∥IEEE International Conference on Data Mining,2004(ICDM’04).IEEE,2004:51-58. [30]FAN W,STOLFO S J,ZHANG J,et al.AdaCost:misclassification cost-sensitive boosting[C]∥ICML.1999:97-105. [31]SUN Y,KAMEL M S,WANG Y.Boosting for learning multiple classes with imbalanced class distribution[C]∥Sixth International Conference on Data Mining (ICDM’06).IEEE,2006:592-602. [32]李秋洁,茅耀斌,王执铨.基于Boosting的不平衡数据分类算法研究[J].计算机科学,2011,38(12):224-228. [33]李雄飞,李军,董元方,等.一种新的不平衡数据学习算法PCBoost[J].计算机学报,2012,35(2):202-209. [34]袁兴梅,杨明,杨杨.一种面向不平衡数据的结构化SVM集成分类器[J].模式识别与人工智能,2013,26(3):315-320. [35]ARUNASALAM B,CHAWLA S.CCCS:a top-down associative classifier for imbalanced class distribution[C]∥12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2006:517-522. [36]PATEL H,THAKUR G S.A Hybrid Weighted Nearest Neighbor Approach to Mine Imbalanced Data[C]∥International Conference on Data Mining (DMIN).2016:106. [37]IMAM T,KAI M T,KAMRUZZAMAN J.z-SVM:An SVM for Improved Classification of Imbalanced Data[C]∥Australasian Joint Conference on Artificial Intelligence.Springer Berlin Heidelberg,2006:264-273. [38]KUBAT M,HOLTE R C,MATWIN S.Machine learning for the detection of oil spills in satellite radar images[J].Machine Learning,1998,30(2):195-215. [39]BRADLEY A P.The use of the area under the ROC curve in the evaluation of machine learning algorithms[M].Elsevier Science Inc.,1997. [40]FAWCETT T.An introduction to ROC analysis[J].Pattern Recognition Letters,2006,27(8):861-874. [41]PROVOST F,DOMINGOS P.Tree induction for probability- based ranking[J].Machine Learning,2003,52(3):199-215. [42]HAND D J,TILL R J.A simple generalisation of the area under the ROC curve for multiple class classification problems[J].Machine Learning,2001,45(2):171-186. [43]DAVIS J,GOADRICH M.The relationship between Precision-Recall and ROC curves[C]∥23rd International Conference on Machine Learning.ACM,2006:233-240. [44]DRUMMOND C,HOLTE R C.Cost curves:An improved method for visualizing classifier performance[J].Machine learning,2006,65(1):95-130.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Overview of Imbalanced Data Classification

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 2

Metrics

Comments

Recommended 0

[1]	CAO Ya-xi, HUANG Hai-yan. Imbalanced Data Classification Algorithm Based on Probability Sampling and Ensemble Learning [J]. Computer Science, 2019, 46(5): 203-208.
[2]	. Research on Boosting-based Imbalanced Data Classification [J]. Computer Science, 2011, 38(12): 224-228.