计算机科学 ›› 2019, Vol. 46 ›› Issue (1): 94-99.doi: 10.11896/j.issn.1002-137X.2019.01.014

• 2018 年第七届中国数据挖掘会议 • 上一篇    下一篇

样本自适应的不平衡分类器

才子昕, 王馨月, 徐剑, 景丽萍   

  1. (北京交通大学交通数据分析与挖掘北京市重点实验室 北京100044)
  • 收稿日期:2018-04-23 出版日期:2019-01-15 发布日期:2019-02-25
  • 作者简介:才子昕(1995-),女,硕士生,主要研究方向为机器学习和不平衡数据分类;王馨月(1994-),女,博士生,主要研究方向为不平衡数据分析;徐 剑(1994-),男,硕士生,主要研究方向为机器学习和不平衡数据分类;景丽萍(1978-),女,博士,教授,CCF会员,主要研究方向为高维数据子空间研究与机器学习,E-mail:lpjing@bjtu.edu.cn(通信作者)。
  • 基金资助:
    国家自然科学基金(61370129,61375062,61632004,61773050)资助

Sample Adaptive Classifier for Imbalanced Data

CAI Zi-xin, WANG Xin-yue, XU Jian, JING Li-ping   

  1. (Beijing Key Lab of Traffic Data Analysis and Mining,Beijing Jiaotong University,Beijing 100044,China)
  • Received:2018-04-23 Online:2019-01-15 Published:2019-02-25

摘要: 大数据时代,不平衡数据分类在实际应用场景中频繁出现。以二分类为例,传统分类器由于较难学习少数类数据集内部的本质结构,容易将少数类样本错误分类。针对这一问题,一种有效的解决方法是在传统的方法中引入代价敏感机制,为少数类样本赋予更高的误分代价以提升其预测精度。这类方法同等对待了同类样本集中的数据,然而同一类内的不同样本可能对训练过程有不同程度的贡献。为了提升代价敏感机制的有效性,样本自适应的代价敏感策略为不同的样本赋予不同的权重。首先,通过考察样本局部的类分布情况,判断其距离两类样本边界的远近;然后,根据边界分布理论,即距离决策面越近的样本对决策面位置的影响越大,为距离两类样本边界越近的样本赋予越高的权重。实验过程中,通过将样本自适应代价敏感策略应用于LDM,并在标准数据集上进行一系列对比实验,验证了样本自适应代价敏感策略在处理不平衡数据分类问题上的有效性。

关键词: 边界样本, 代价敏感学习, 分类

Abstract: In the era of big data,the imbalanced data is ubiquitous and inevitable,which has been a critical classification issue.Taking binary classification as an example,traditional learning algorithms can not sufficiently learn the hidden patterns from the minority class and may be biased towards majority class.To solve this problem,an effective way is using the cost-sensitive learning to improve the performance of prediction for the minority class which assigns ahighercost to misclassification of the minority.However,these methods equally treat the instances within one class.Actually,different instances may make different contributions to learning process.In order to make the cost-sensitive learning more effective,this paper proposed a sample-adaptive and cost-sensitive strategy for the classification of imbalanced data,which assigns a different weight to every single instance if misclassification occurs.Firstly,the strategy determines the distances between the boundary and instances according to the local distribution of the instances.Then,it assigns higher weights to the instances nearer to the boundary on the top of the margin theory.In this paper,the proposed strategy was applied to the classical LDM method.And a series of experiments on the UCI datasets prove that the sample-adaptive and cost-sensitive strategy can effectively improve the classifier’s performance on imbalanced data classification.

Key words: Boundary sample, Classification, Cost-sensitive learning

中图分类号: 

  • TP391
[1]RADIVOJAC P,CHAWLA N V,DUNKER A K,et al.Classification and knowledge discovery in protein databases[J].Journal of Biomedical Informatics,2004,37(4):224-239.<br /> [2]ZOU Q,GUO M Z,LIU Y,et al.A classification method for class imbalanced data and its application on bioinformatics[J].Journal of Computer Research and Development,2010,47(8):1407-1414.(in Chinese)<br /> 邹权,郭茂祖,刘扬,等.类别不平衡的分类方法及在生物信息学中的应用[J].计算机研究与发展,2010,47(8):1407-1414.<br /> [3]EZAWA K J,SINGH M,NORTON S W.Learning goal oriented Bayesian networks for telecommunications risk management[C]//Proceedings of the International Conference on Machine Lear-ning.Bari,Italy:Morgan Kauffman,1996:139-147.<br /> [4]SANZ JA,BERNARDO D,HERRERA F,et al.A compact evolutionary interval valued fuzzy rule-based classification system for the modeling and prediction of real-world financial applications with imbalanced data[C]//Proceedings of IEEE Trans on Fuzzy Systems,2015,23(4):973-990.<br /> [5]SU J S,ZHANG B F,XU X.Advances in machine learning based text categorization[J].Journal of Software,2006,17(9):1848-1859.(in Chinese)<br /> 苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859.<br /> [6]DEEBA F,MOHAMMED S K,BUI F M,et al.Learning from imbalanced data:a comprehensive comparison of classifier performance for bleeding detection in endoscopic video[C]//Proceedings of International Conference on Informatics,Electronics and Vision.IEEE,2016:1006-1009.<br /> [7]RANI K U,RAMADEVI G N,LAVANYA D.Performance of synthetic minority oversampling technique on imbalanced breast cancer data[C]//Proceedings of International Conference on Computing for Sustainable Global Development.IEEE,2016:1623-1627.<br /> [8]PROVOST F.Machine learning from imbalanced data sets 101[C]//Proceedings of the AAAI’2000 Workshop on Imbalanced Data.IEEE,2000.<br /> [9]RAO R B.Data mining for improved cardiac care[J].ACM SIGKDD Explorations Newsletter,2006,8(1):3-10.<br /> [10]DOMINGOS P.MetaCost:A general method for making classifiers cost-sensitive[C]//Proceedings of Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mi-ning.San Diego:CA,ACM,1999:155-164.<br /> [11]VEROPOULOS K,CAMPBELL C,CRISTIANINI N.Controlling the sensitivity of support vector machines[C]//Proceedings of the International Joint Conference on Artificial Intelligence.Stockholm,Sweden,1999:55-60.<br /> [12]CHENG F Y,ZHANG J,WEN C H.Cost-sensitive large margin distribution machine for classification of imbalanced data[J].Pattern Recognition Letters,2016,80(C):107-112.<br /> [13]CORTES C,VAPNIK V.Support-vector networks[J].Machine Learning,1995,20(5):273-297.<br /> [14]STEFANOWSKI J.Dealing with data difficulty factors while learning from imbalanced data.http://www.cs.put.poznan.pl/jstefanowski/pub/jkbook7wersjaWWW.pdf.<br /> [15]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic minority oversampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.<br /> [16]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A new over-sampling method in imbalanced data sets learning[C]//Proceedings of International Conference on Intelligent Computing.Springer-Verlag,2005:878-887.<br /> [17]HE H,BAI Y,GARCIA E A,et al.ADASYN:Adaptive synthetic sampling approach for imbalanced learning[C]//Procee-dings of IEEE International Joint Conference on Neural Networks.IEEE,2008:1322-1328.<br /> [18]TANG B,HE H.KernelADASYN:Kernel based adaptive synthetic data generation for imbalanced learning[C]//Proceedings of Evolutionary Computation.IEEE,2015:664-671.<br /> [19]BATISTA G,PRATI R C,MONARD M C.A study of the behavior of several methods for balancing machine learningtrai-ning data[J].ACM SIGKDD Explorations Newsletter,2004,6(1):20-29.<br /> [20]CIESLAK D A,CHAWLA N V,STRIEGEL A.Combating imbalance in network intrusion datasets[C]//Proceedings of IEEE International Conference on Granular Computing.IEEE,2006:732-737.<br /> [21]BATUWITA R,PALADE V.Efficient resampling methods for training support vector machines with imbalanced datasets[C]//Proceedings of International Joint Conference on Neural Networks.IEEE,2010:1-8.<br /> [22]ZHOU Z H,LIU X Y.Training cost-sensitive neural networks with methods addressing the class imbalance problem[J].IEEE Trans on Knowledge and Data Engineering,2006,18(1):63-77.<br /> [23]SUN Z,SONG Q,ZHU X,et al.A novel ensemble method for classifying imbalanced data[J].Pattern Recognition,2015,48(5):1623-1637.<br /> [24]CHEN C,BREIMAN L.Using random forest to learn imbalanced data:Technical Report 666 .Berkeley:Department of Statistics,UC Berkeley,2004.<br /> [25]CHAN P K,STOLFO S J.Toward scalable learning with nonuniform class and cost distributions:a case study in credit card fraud detection[C]//International Conference on Knowledge Discovery and Data Mining.AAAI,1998:164-168.<br /> [26]YOAV F,SCHAPIRE R E.A desicion-theoretic generalization of online learning and an application to boosting[C]//Procee-dings of European Conference on Computational Learning Theory.Heidelberg,Berlin:Springer,1995:23-37.<br /> [27]WANG B X,JAPKOWICZ N.Boosting support vector machines for imbalanced data sets[J].Knowledge and Information Systems,2010,25(1):1-20.<br /> [28]SEIFFERT C,KHOSHGOFTAAR T M,HULSE J V,et al. RUSBoost:A hybrid approach to alleviating class imbalance[J].IEEE Trans on Systems Man and Cybernetics Part A Systems and Humans,2010,40(1):185-197.<br /> [29]GALAR M,BARRENECHEA E,HERRERA F.EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary under-sampling[J].Pattern Recognition,2013,46(12):3460-3471.<br /> [30]LIU X Y,WU J X,ZHOU Z H.Exploratory under-sampling for class-imbalance learning[J].IEEE Trans on System,Man and Cybernetics B,2009,39(2):539-550.<br /> [31]OH S,MIN S L,ZHANG B T.Ensemble learning with active example selection for imbalanced biomedical data classification[J].IEEE/ACM Trans on Computational Biology and Bioinforma-tics,2011,8(2):316-325.<br /> [32]ZHANG X X,YANG T B,SRINIVASAN P.Online asymmetric active learning with imbalanced data[C]//Proceedings of ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining.ACM,2016:2055-2064.<br /> [33]AKBANI R,KWEK S,JAPKOWICZ N.Applying support vector machines to imbalanced datasets[C]//Proceedings of the 15th European Conference on Machine Learning.Springer Berlin Heidelberg,2004:39-50.<br /> [34]GAO W,ZHOU Z H.On the doubt about margin explanation of boosting[J].Artificial Intelligence,2013,203:1-18.
[1] 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙.
数据流概念漂移处理方法研究综述
Survey of Concept Drift Handling Methods in Data Streams
计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112
[2] 周旭, 钱胜胜, 李章明, 方全, 徐常胜.
基于对偶变分多模态注意力网络的不完备社会事件分类方法
Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification
计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022
[3] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[4] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[5] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[6] 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航.
监督和半监督学习下的多标签分类综述
Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning
计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[7] 高振卓, 王志海, 刘海洋.
嵌入典型时间序列特征的随机Shapelet森林算法
Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features
计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[8] 杨炳新, 郭艳蓉, 郝世杰, 洪日昌.
基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用
Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition
计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070
[9] 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥.
视频理解中的动作质量评估方法综述
Survey on Action Quality Assessment Methods in Video Understanding
计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028
[10] 黄璞, 沈阳阳, 杜旭然, 杨章静.
基于局部约束特征线表示的人脸识别
Face Recognition Based on Locality Constrained Feature Line Representation
计算机科学, 2022, 49(6A): 429-433. https://doi.org/10.11896/jsjkx.210300169
[11] 杨涵, 万游, 蔡洁萱, 方铭宇, 吴卓超, 金扬, 钱伟行.
基于步态分类辅助的虚拟IMU的行人导航方法
Pedestrian Navigation Method Based on Virtual Inertial Measurement Unit Assisted by GaitClassification
计算机科学, 2022, 49(6A): 759-763. https://doi.org/10.11896/jsjkx.211200148
[12] 邵欣欣.
TI-FastText自动商品分类算法
TI-FastText Automatic Goods Classification Algorithm
计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[13] 陈景年.
一种适于多分类问题的支持向量机加速方法
Acceleration of SVM for Multi-class Classification
计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149
[14] 杨健楠, 张帆.
一种结合双注意力机制和层次网络结构的细碎农作物分类方法
Classification Method for Small Crops Combining Dual Attention Mechanisms and Hierarchical Network Structure
计算机科学, 2022, 49(6A): 353-357. https://doi.org/10.11896/jsjkx.210200169
[15] 庞兴龙, 朱国胜.
基于半监督学习的网络流量分析研究
Survey of Network Traffic Analysis Based on Semi Supervised Learning
计算机科学, 2022, 49(6A): 544-554. https://doi.org/10.11896/jsjkx.210600131
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!