计算机科学 ›› 2020, Vol. 47 ›› Issue (1): 102-109.doi: 10.11896/jsjkx.190600060

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于海林格距离和SMOTE的多类不平衡学习算法

董明刚1,2,姜振龙1,敬超1,2   

  1. (桂林理工大学信息科学与工程学院 广西 桂林541004)1;
    (广西嵌入式技术与智能系统重点实验室 广西 桂林541004)2
  • 收稿日期:2019-06-12 发布日期:2020-01-19
  • 通讯作者: 敬超(jingchao@glut.edu.cn)
  • 基金资助:
    国家自然科学基金(61563012,61802085);广西自然科学基金(2014GXNSFAA118371,2015GXNSFBA139260);广西嵌入式技术与智能系统重点实验室基金(2018A-04)

Multi-class Imbalanced Learning Algorithm Based on Hellinger Distance and SMOTE Algorithm

DONG Ming-gang1,2,JIANG Zhen-long1,JING Chao1,2   

  1. (College of Information Science and Engineering,Guilin University of Technology,Guilin,Guangxi 541004,China)1;
    (Guangxi Key Laboratory of Embedded Technology and Intelligent System,Guilin,Guangxi 541004,China)2
  • Received:2019-06-12 Published:2020-01-19
  • About author:DONG Ming-gang,born in 1977,Ph.D,professor,is senior member of China Computer Federation(CCF).His main research interests include intelligent computing,multi-objective optimization and machine learning;JING Chao,born in 1983,Ph.D,asso-ciate professor.His main research inte-rests include cloud computing and big data processing,workflow scheduling on cloud data center and deep reinforcement learning.
  • Supported by:
    This work supported by the National Natural Science Foundation of China (61563012,61802085),Natural Science Foundation of Guangxi, China (2014GXNSFAA118371,2015GXNSFBA139260) and Guangxi Key Laboratory of Embedded Technology and Intelligent System Foundation (2018A-04).

摘要: 数据不平衡现象在现实生活中普遍存在。在处理不平衡数据时,传统的机器学习算法难以达到令人满意的效果。少数类样本合成上采样技术(Synthetic Minority Oversampling Technique,SMOTE)是一种有效的方法,但在多类不平衡数据中,边界点分布错乱和类别分布不连续变得更加复杂,导致合成的样本点会侵入其他类别区域,造成数据过泛化。鉴于基于海林格距离的决策树已被证明对不平衡数据具有不敏感性,文中结合海林格距离和SMOTE,提出了一种基于海林格距离和SMOTE的上采样算法(Based on Hellinger Distance and SMOTE Oversampling Algorithm,HDSMOTE)。首先,建立基于海林格距离的采样方向选择策略,通过比较少数类样本点的局部近邻域内的海林格距离的大小,来引导合成样本点的方向。其次,设计了基于海林格距离的采样质量评估策略,以免合成的样本点侵入其他类别的区域,降低过泛化的风险。最后,采用7种代表性的上采样算法和HDSMOTE算法对15个多类不平衡数据集进行预处理,使用决策树的分类器进行分类,以Precision,Recall,F-measure,G-mean和MAUC作为评价标准对各算法的性能进行评价。实验结果表明,相比于对比算法,HDSMOTE算法在以上评价标准上均有所提升:在Precision上最高提升了17.07%,在Recall上最高提升了21.74%,在F-measure上最高提升了19.63%,在G-mean上最高提升了16.37%,在MAUC上最高提升了8.51%。HDSMOTE相对于7种代表性的上采样方法,在处理多类不平衡数据时有更好的分类效果。

关键词: SMOTE, 多类不平衡学习, 分类, 海林格距离, 上采样

Abstract: Imbalanced data is common in real life.Traditional machine learning algorithms are difficult to achieve satisfied results on imbalanced data.The synthetic minority oversampling technique (SMOTE) is an efficient method to handle this problem.However,in multi-class imbalanced data,disordered distribution of boundary sample and discontinuous class distribution become more complicated,and the synthetic samples may invade other classes area,leading to over-generalization.In order to solve this issue,considering the algorithm based on Hellinger distance decision tree has been proved to be insensitive to imbalanced data,combining with Hellinger distance and SMOTE,this paper proposed an oversampling method SMOTE with Hellinger distance (HDSMOTE).Firstly,a sampling direction selection strategy was presented based on Hellinger distances of local neighborhood area,which can guide the direction of the synthesized sample.Secondly,a sampling quality evaluation strategy based on Hellinger distance was designed to avoid the synthesized sample into other classes,which can reduce the risk of over-generalization.Finally,to demonstrate the performance of HDSMOTE,15 multi-class imbalanced data sets were preprocessed by 7 representative oversampling algorithms and HDSMOTE algorithm,and were classified with C4.5 decision tree.Precision,Recall,F-measure,G-mean and MAUC are employed as the evaluation standards.Compared with competitive oversampling methods,the experimental results show that the HDSMOTE algorithm has improved in the these evaluation standards.It is increased by 17.07% in Precision,21.74% in Recall,19.63% in F-measure,16.37% in G-mean,and 8.51% in MAUC.HDSMOTE has better classification performance than the seven representative oversampling methods on multi-class imbalanced data.

Key words: Classification, Hellinger distance, Multi-class imbalanced learning, Oversampling, SMOTE

中图分类号: 

  • TP311
[1]HE H,GARCIA E A.Learning from Imbalanced Data [J]. IEEE Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284.
[2]KRAWCZYK,BARTOSZ.Learning from imbalanced data:open challenges and future directions [J].Progress in Artificial Intelligence,2016,5(4):221-232.
[3]LI Y X,CHAI Y,HU Y Q,et al.Review of imbalanced data classification methods[J].Control and Decision,2019,34(4):673-688.
[4]ZHAO N,ZHANG X F,ZHANG L J.Overview of Imbalanced Data Classification[J].Computer Science,2018,45(S1):22-27,57.
[5]LI Y,LIU Z D,ZHANG H J.Review on ensemble algorithms for imbalanced data classification[J].Application Research of Computers,2014,31(5):1287-1291.
[6]GUO H X,LI Y J,JENNIFER S,et al.Learning from class-imbalanced data:Review of methods and applications [J].Expert Systems with Applications,2017,73:220-239.
[7]MIAO Z M,ZHAO L W,TIAN S W,et al.Class Imbalance Learning for Identifying NLOS in UWB Positioning[J].Journal of Signal Processing,2016,32(1):8-13.
[8]XIA P P,ZHANG L.Application of Imbalanced Data Learning Algorithms to Similarity Learning[J].Pattern Recognition and Artificial Intelligence | Patt Recog Artif Intell,2014,27(12):1138-1145.
[9]WEI W W,LI J J,GAO L B.Effective detection of sophisticated online banking fraud on extremely;imbalanced data[J].World Wide Web-internet & Web Information Systems,2013,16(4):449-475.
[10]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[11]HE H,BAI Y,GARCIA E A,et al.ADASYN:Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]∥IEEE International Joint Conference on Neural Networks,2008(IJCNN 2008).IEEE,2008:1322-1328.
[12]BARUA S,ISLAM M M,YAO X,et al.MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning[J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):405-425.
[13]NEKOOEIMEHR I,LAI-YUEN S K.Adaptive semi-unsuper- vised weighted oversampling (A-SUWO) for imbalanced datasets[J].Expert Systems with Applications,2016,46:405-416.
[14]PUNTUMAPON K,RAKTHAMAMON T,WAIYAMAI K. Clusterbased minority over-sampling for imbalanced datasets[J].IEICE TRANSACTIONS on Information and Systems,2016,99(12):3101-3109.
[15]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A New Over-Sampling Method in Imbalanced Data Sets Learning[M]∥Advances in Intelligent Computing.Springer Berlin Heidelberg,2005:878-887.
[16]ANAND R,MEHROTRA K,MOHAN C K,et al.Efficient classification for multiclass problems using modular neural networks[J].IEEE Transactions on Neural Networks,1995,6(1):117-124.
[17]ZHU T,LIN Y,LIU Y.Synthetic minority oversampling technique for multiclass imbalance problems [J].Pattern Recognition,2017,72:327-340.
[18]ABDI L,HASHEMI S.To combat multi-class imbalanced problems by means of over-sampling techniques[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(1):238-251.
[19]YANG X,KUANG Q,ZHANG W,et al.AMDO:An Over-Sampling Technique for Multi-Class Imbalanced Problems[J].IEEE Transactions on Knowledge & Data Engineering,2018,30(9):1672-1685.
[20]CIESLAK D A,CHAWLA N V.Learning Decision Trees for Unbalanced Data[C]∥Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Berlin:Sprin-ger,2008:241-256.
[21]CIESLAK D A,HOENS T R,CHAWLA N V,et al.Hellinger distance decision trees are robust and skew-insensitive[J].Data Mining and Knowledge Discovery,2012,24(1):136-158.
[22]UCI.Machine Learning Repository[OL].http://mlr.cs.uma-ss.edu/ml/datasets.html.
[23]KEEL Dataset[OL].https://sci2s.ugr.es/keel/category.php?cat=clas&order=name#sub2.
[24]HAND D J,TILL R J.A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems[J].Machine Learning,2001,45(2):171-186.
[25]COHEN W W.Fast Effective Rule Induction[C]∥Twelfth International Conference on International Conference on Machine Learning.Elsevier,1995:115-123.
[1] 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙.
数据流概念漂移处理方法研究综述
Survey of Concept Drift Handling Methods in Data Streams
计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112
[2] 周旭, 钱胜胜, 李章明, 方全, 徐常胜.
基于对偶变分多模态注意力网络的不完备社会事件分类方法
Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification
计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022
[3] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[4] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[5] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[6] 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航.
监督和半监督学习下的多标签分类综述
Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning
计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[7] 高振卓, 王志海, 刘海洋.
嵌入典型时间序列特征的随机Shapelet森林算法
Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features
计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[8] 杨炳新, 郭艳蓉, 郝世杰, 洪日昌.
基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用
Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition
计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070
[9] 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥.
视频理解中的动作质量评估方法综述
Survey on Action Quality Assessment Methods in Video Understanding
计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028
[10] 邵欣欣.
TI-FastText自动商品分类算法
TI-FastText Automatic Goods Classification Algorithm
计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[11] 陈景年.
一种适于多分类问题的支持向量机加速方法
Acceleration of SVM for Multi-class Classification
计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149
[12] 杨健楠, 张帆.
一种结合双注意力机制和层次网络结构的细碎农作物分类方法
Classification Method for Small Crops Combining Dual Attention Mechanisms and Hierarchical Network Structure
计算机科学, 2022, 49(6A): 353-357. https://doi.org/10.11896/jsjkx.210200169
[13] 杨涵, 万游, 蔡洁萱, 方铭宇, 吴卓超, 金扬, 钱伟行.
基于步态分类辅助的虚拟IMU的行人导航方法
Pedestrian Navigation Method Based on Virtual Inertial Measurement Unit Assisted by GaitClassification
计算机科学, 2022, 49(6A): 759-763. https://doi.org/10.11896/jsjkx.211200148
[14] 黄璞, 沈阳阳, 杜旭然, 杨章静.
基于局部约束特征线表示的人脸识别
Face Recognition Based on Locality Constrained Feature Line Representation
计算机科学, 2022, 49(6A): 429-433. https://doi.org/10.11896/jsjkx.210300169
[15] 庞兴龙, 朱国胜.
基于半监督学习的网络流量分析研究
Survey of Network Traffic Analysis Based on Semi Supervised Learning
计算机科学, 2022, 49(6A): 544-554. https://doi.org/10.11896/jsjkx.210600131
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!