计算机科学 ›› 2018, Vol. 45 ›› Issue (6A): 22-27.
赵楠,张小芳,张利军
ZHAO Nan,ZHANG Xiao-fang,ZHANG Li-jun
摘要: 在很多应用领域中,数据的类别分布不平衡,如何对其正确分类是数据挖掘和机器学习领域中的研究热点。经典的数据分类算法未考虑数据类别的不平衡性,认为类别之间的误分类代价相同,导致不平衡数据分类的效果不理想。针对数据分类的各个步骤,相继提出了不同的不平衡数据分类处理方法。对多年来的相关研究成果进行归类分析,从特征选择、数据分布调整、分类算法、分类结果评估等几个方面系统地介绍了相关方法,并探讨了进一步的探索方向。
中图分类号:
[1]HAN J,PEI J,KAMBER M.Data mining:concepts and techniques[M].Elsevier,2011:162-164.<br /> [2]CHAWLA N,JAPKOWICZ N,KOTCZ A,et al.Special Issue on Learning from Imbalanced Data Sets [J].ACM SIGKDD Explorations Newsletter,2004,6(1):1-6.<br /> [3]CHEN X,WASIKOWSKI M.Fast:a roc-based feature selection metric for small samples and imbalanced data classification problems[C]∥14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2008:124-132.<br /> [4]FORMAN G.An extensive empirical study of feature selection metrics for text classification[J].Journal of machine learning research,2003,3(2):1289-1305.<br /> [5]MEMBER M W,CHEN X W.Combating the Small Sample Class Imbalance Problem Using Feature Selection[J].IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1388-1400.<br /> [6]VAN D P P,VAN S M.A bias-variance analysis of a real world learning problem:The CoIL challenge 2000[J].Machine Lear-ning,2004,57(1):177-195.<br /> [7]ELKAN C.Magical thinking in data mining:lessons from CoIL challenge 2000[C]∥Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2001:426-431.<br /> [8]GUYON I,ELISSEEFF A.An introduction to variable and feature selection[J].Journal of Machine Learning Research,2003,3(6):1157-1182.<br /> [9]MOAYEDIKIA A,ONG K L,BOO Y L,et al.Feature selection for high dimensional imbalanced class data using harmony search[J].Engineering Applications of Artificial Intelligence,2017,57(C):38-49.<br /> [10]王杰,李德玉,王素格.面向非平衡文本情感分类的TSF特征选择方法[J].计算机科学,2016,43(10):206-210,224.<br /> [11]MLADENIC D,GROBELNIK M.Feature selection for unba- lanced class distribution and naive bayes[C]∥ICML.1999:258-267.<br /> [12]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of artificial intelligence research,2002,16(1):321-357.<br /> [13]CHAWLA N V,LAZAREVIC A,HALL L O,et al.SMOTEBoost:Improving prediction of the minority class in boosting[C]∥European Conference on Principles of Data Mining and Know-ledge Discovery.Springer Berlin Heidelberg,2003:107-119.<br /> [14]熊冰妍,王国胤,邓维斌.基于样本权重的不平衡数据欠抽样方法[J].计算机研究与发展,2016,53(11):2613-2622.<br /> [15]KUBAT M,MATWIN S.Addressing the curse of imbalanced training sets:one-sided selection[C]∥ICML.1997:179-186.<br /> [16]HART P E.The Condensed Nearest Neighbor Rule[J].IEEE Transactions on Information Theory,1968,14:515-516.<br /> [17]LAURIKKALA J.Improving identification of difficult small classes by balancing class distribution[C]∥Conference on Artificial Intelligence in Medicine in Europe.Springer Berlin Heidelberg,2001:63-66.<br /> [18]胡小生,张润晶,钟勇.两层聚类的类别不平衡数据挖掘算法[J].计算机科学,2013,40(11):271-275.<br /> [19]李克文,杨磊,刘文英,等.基于RSBoost算法的不平衡数据分类方法[J].计算机科学,2015,42(9):249-252.<br /> [20]CHAN P K,STOLFO S J.Toward Scalable Learning with Non-Uniform Class and Cost Distributions:A Case Study in Credit Card Fraud Detection[C]∥KDD.1998:164-168.<br /> [21]SUN Z,SONG Q,ZHU X,et al.A novel ensemble method for classifying imbalanced data[J].Pattern Recognition,2015,48(5):1623-1637.<br /> [22]KITTLER J,HATEF M,DUIN R P W,et al.On combining classifiers[J].IEEE transactions on pattern analysis and machine intelligence,1998,20(3):226-239.<br /> [23]SCH LKOPF B,PLATT J C,SHAWE-TAYLOR J,et al.Estimating the support of a high-dimensional distribution[J].Neural computation,2001,13(7):1443-1471.<br /> [24]COHEN G,HILARIO M,PELLEGRINI C.One-class support vector machines with a conformal kernel.a case study in handling class imbalance[C]∥Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR).Springer Berlin Heidelberg,2004:850-858.<br /> [25]MANEVITZ L M,YOUSEF M.One-class SVMs for document classification[J].Journal of Machine Learning Research,2001,2(1):139-154.<br /> [26]ELKAN C.The foundations of cost-sensitive learning[C]∥International Joint Conference on Artificial Intelligence.2001:973-978.<br /> [27]DOMINGOS P.Metacost:A general method for making classi- fiers cost-sensitive[C]∥Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,1999:155-164.<br /> [28]蒋盛益,谢照青,余雯.基于代价敏感的朴素贝叶斯不平衡数据分类研究[J].计算机研究与发展,2011,48(S1):387-390.<br /> [29]CHAI X,DENG L,YANG Q,et al.Test-cost sensitive naive bayes classification[C]∥IEEE International Conference on Data Mining,2004(ICDM’04).IEEE,2004:51-58.<br /> [30]FAN W,STOLFO S J,ZHANG J,et al.AdaCost:misclassification cost-sensitive boosting[C]∥ICML.1999:97-105.<br /> [31]SUN Y,KAMEL M S,WANG Y.Boosting for learning multiple classes with imbalanced class distribution[C]∥Sixth International Conference on Data Mining (ICDM’06).IEEE,2006:592-602.<br /> [32]李秋洁,茅耀斌,王执铨.基于Boosting的不平衡数据分类算法研究[J].计算机科学,2011,38(12):224-228.<br /> [33]李雄飞,李军,董元方,等.一种新的不平衡数据学习算法PCBoost[J].计算机学报,2012,35(2):202-209.<br /> [34]袁兴梅,杨明,杨杨.一种面向不平衡数据的结构化SVM集成分类器[J].模式识别与人工智能,2013,26(3):315-320.<br /> [35]ARUNASALAM B,CHAWLA S.CCCS:a top-down associative classifier for imbalanced class distribution[C]∥12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2006:517-522.<br /> [36]PATEL H,THAKUR G S.A Hybrid Weighted Nearest Neighbor Approach to Mine Imbalanced Data[C]∥International Conference on Data Mining (DMIN).2016:106.<br /> [37]IMAM T,KAI M T,KAMRUZZAMAN J.z-SVM:An SVM for Improved Classification of Imbalanced Data[C]∥Australasian Joint Conference on Artificial Intelligence.Springer Berlin Heidelberg,2006:264-273.<br /> [38]KUBAT M,HOLTE R C,MATWIN S.Machine learning for the detection of oil spills in satellite radar images[J].Machine Learning,1998,30(2):195-215.<br /> [39]BRADLEY A P.The use of the area under the ROC curve in the evaluation of machine learning algorithms[M].Elsevier Science Inc.,1997.<br /> [40]FAWCETT T.An introduction to ROC analysis[J].Pattern Recognition Letters,2006,27(8):861-874.<br /> [41]PROVOST F,DOMINGOS P.Tree induction for probability- based ranking[J].Machine Learning,2003,52(3):199-215.<br /> [42]HAND D J,TILL R J.A simple generalisation of the area under the ROC curve for multiple class classification problems[J].Machine Learning,2001,45(2):171-186.<br /> [43]DAVIS J,GOADRICH M.The relationship between Precision-Recall and ROC curves[C]∥23rd International Conference on Machine Learning.ACM,2006:233-240.<br /> [44]DRUMMOND C,HOLTE R C.Cost curves:An improved method for visualizing classifier performance[J].Machine learning,2006,65(1):95-130. |
[1] | 曹雅茜, 黄海燕. 基于概率采样和集成学习的不平衡数据分类算法 Imbalanced Data Classification Algorithm Based on Probability Sampling and Ensemble Learning 计算机科学, 2019, 46(5): 203-208. https://doi.org/10.11896/j.issn.1002-137X.2019.05.031 |
[2] | 王利君, 支志英, 贾鹿, 李伟. 基于SCRF的抽油井结蜡预测方法优化研究 Study on Optimized Method for Predicting Paraffin Deposition of Pumping Wells Based on SCRF 计算机科学, 2019, 46(11A): 599-603. |
[3] | 李秋洁,茅耀斌,王执锉. 基于Boosting的不平衡数据分类算法研究 Research on Boosting-based Imbalanced Data Classification 计算机科学, 2011, 38(12): 224-228. |
|