Computer Science ›› 2020, Vol. 47 ›› Issue (1): 102-109.doi: 10.11896/jsjkx.190600060

• Database & Big Data & Data Science • Previous Articles     Next Articles

Multi-class Imbalanced Learning Algorithm Based on Hellinger Distance and SMOTE Algorithm

DONG Ming-gang1,2,JIANG Zhen-long1,JING Chao1,2   

  1. (College of Information Science and Engineering,Guilin University of Technology,Guilin,Guangxi 541004,China)1;
    (Guangxi Key Laboratory of Embedded Technology and Intelligent System,Guilin,Guangxi 541004,China)2
  • Received:2019-06-12 Published:2020-01-19
  • About author:DONG Ming-gang,born in 1977,Ph.D,professor,is senior member of China Computer Federation(CCF).His main research interests include intelligent computing,multi-objective optimization and machine learning;JING Chao,born in 1983,Ph.D,asso-ciate professor.His main research inte-rests include cloud computing and big data processing,workflow scheduling on cloud data center and deep reinforcement learning.
  • Supported by:
    This work supported by the National Natural Science Foundation of China (61563012,61802085),Natural Science Foundation of Guangxi, China (2014GXNSFAA118371,2015GXNSFBA139260) and Guangxi Key Laboratory of Embedded Technology and Intelligent System Foundation (2018A-04).

Abstract: Imbalanced data is common in real life.Traditional machine learning algorithms are difficult to achieve satisfied results on imbalanced data.The synthetic minority oversampling technique (SMOTE) is an efficient method to handle this problem.However,in multi-class imbalanced data,disordered distribution of boundary sample and discontinuous class distribution become more complicated,and the synthetic samples may invade other classes area,leading to over-generalization.In order to solve this issue,considering the algorithm based on Hellinger distance decision tree has been proved to be insensitive to imbalanced data,combining with Hellinger distance and SMOTE,this paper proposed an oversampling method SMOTE with Hellinger distance (HDSMOTE).Firstly,a sampling direction selection strategy was presented based on Hellinger distances of local neighborhood area,which can guide the direction of the synthesized sample.Secondly,a sampling quality evaluation strategy based on Hellinger distance was designed to avoid the synthesized sample into other classes,which can reduce the risk of over-generalization.Finally,to demonstrate the performance of HDSMOTE,15 multi-class imbalanced data sets were preprocessed by 7 representative oversampling algorithms and HDSMOTE algorithm,and were classified with C4.5 decision tree.Precision,Recall,F-measure,G-mean and MAUC are employed as the evaluation standards.Compared with competitive oversampling methods,the experimental results show that the HDSMOTE algorithm has improved in the these evaluation standards.It is increased by 17.07% in Precision,21.74% in Recall,19.63% in F-measure,16.37% in G-mean,and 8.51% in MAUC.HDSMOTE has better classification performance than the seven representative oversampling methods on multi-class imbalanced data.

Key words: Classification, Hellinger distance, Multi-class imbalanced learning, Oversampling, SMOTE

CLC Number: 

  • TP311
[1]HE H,GARCIA E A.Learning from Imbalanced Data [J]. IEEE Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284.
[2]KRAWCZYK,BARTOSZ.Learning from imbalanced data:open challenges and future directions [J].Progress in Artificial Intelligence,2016,5(4):221-232.
[3]LI Y X,CHAI Y,HU Y Q,et al.Review of imbalanced data classification methods[J].Control and Decision,2019,34(4):673-688.
[4]ZHAO N,ZHANG X F,ZHANG L J.Overview of Imbalanced Data Classification[J].Computer Science,2018,45(S1):22-27,57.
[5]LI Y,LIU Z D,ZHANG H J.Review on ensemble algorithms for imbalanced data classification[J].Application Research of Computers,2014,31(5):1287-1291.
[6]GUO H X,LI Y J,JENNIFER S,et al.Learning from class-imbalanced data:Review of methods and applications [J].Expert Systems with Applications,2017,73:220-239.
[7]MIAO Z M,ZHAO L W,TIAN S W,et al.Class Imbalance Learning for Identifying NLOS in UWB Positioning[J].Journal of Signal Processing,2016,32(1):8-13.
[8]XIA P P,ZHANG L.Application of Imbalanced Data Learning Algorithms to Similarity Learning[J].Pattern Recognition and Artificial Intelligence | Patt Recog Artif Intell,2014,27(12):1138-1145.
[9]WEI W W,LI J J,GAO L B.Effective detection of sophisticated online banking fraud on extremely;imbalanced data[J].World Wide Web-internet & Web Information Systems,2013,16(4):449-475.
[10]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[11]HE H,BAI Y,GARCIA E A,et al.ADASYN:Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]∥IEEE International Joint Conference on Neural Networks,2008(IJCNN 2008).IEEE,2008:1322-1328.
[12]BARUA S,ISLAM M M,YAO X,et al.MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning[J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):405-425.
[13]NEKOOEIMEHR I,LAI-YUEN S K.Adaptive semi-unsuper- vised weighted oversampling (A-SUWO) for imbalanced datasets[J].Expert Systems with Applications,2016,46:405-416.
[14]PUNTUMAPON K,RAKTHAMAMON T,WAIYAMAI K. Clusterbased minority over-sampling for imbalanced datasets[J].IEICE TRANSACTIONS on Information and Systems,2016,99(12):3101-3109.
[15]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A New Over-Sampling Method in Imbalanced Data Sets Learning[M]∥Advances in Intelligent Computing.Springer Berlin Heidelberg,2005:878-887.
[16]ANAND R,MEHROTRA K,MOHAN C K,et al.Efficient classification for multiclass problems using modular neural networks[J].IEEE Transactions on Neural Networks,1995,6(1):117-124.
[17]ZHU T,LIN Y,LIU Y.Synthetic minority oversampling technique for multiclass imbalance problems [J].Pattern Recognition,2017,72:327-340.
[18]ABDI L,HASHEMI S.To combat multi-class imbalanced problems by means of over-sampling techniques[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(1):238-251.
[19]YANG X,KUANG Q,ZHANG W,et al.AMDO:An Over-Sampling Technique for Multi-Class Imbalanced Problems[J].IEEE Transactions on Knowledge & Data Engineering,2018,30(9):1672-1685.
[20]CIESLAK D A,CHAWLA N V.Learning Decision Trees for Unbalanced Data[C]∥Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Berlin:Sprin-ger,2008:241-256.
[21]CIESLAK D A,HOENS T R,CHAWLA N V,et al.Hellinger distance decision trees are robust and skew-insensitive[J].Data Mining and Knowledge Discovery,2012,24(1):136-158.
[22]UCI.Machine Learning Repository[OL].
[23]KEEL Dataset[OL].
[24]HAND D J,TILL R J.A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems[J].Machine Learning,2001,45(2):171-186.
[25]COHEN W W.Fast Effective Rule Induction[C]∥Twelfth International Conference on International Conference on Machine Learning.Elsevier,1995:115-123.
[1] CHEN Zhi-qiang, HAN Meng, LI Mu-hang, WU Hong-xin, ZHANG Xi-long. Survey of Concept Drift Handling Methods in Data Streams [J]. Computer Science, 2022, 49(9): 14-32.
[2] ZHOU Xu, QIAN Sheng-sheng, LI Zhang-ming, FANG Quan, XU Chang-sheng. Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification [J]. Computer Science, 2022, 49(9): 132-138.
[3] HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[4] TAN Ying-ying, WANG Jun-li, ZHANG Chao-bo. Review of Text Classification Methods Based on Graph Convolutional Network [J]. Computer Science, 2022, 49(8): 205-216.
[5] YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[6] WU Hong-xin, HAN Meng, CHEN Zhi-qiang, ZHANG Xi-long, LI Mu-hang. Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning [J]. Computer Science, 2022, 49(8): 12-25.
[7] GAO Zhen-zhuo, WANG Zhi-hai, LIU Hai-yang. Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features [J]. Computer Science, 2022, 49(7): 40-49.
[8] YANG Bing-xin, GUO Yan-rong, HAO Shi-jie, Hong Ri-chang. Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition [J]. Computer Science, 2022, 49(7): 57-63.
[9] ZHANG Hong-bo, DONG Li-jia, PAN Yu-biao, HSIAO Tsung-chih, ZHANG Hui-zhen, DU Ji-xiang. Survey on Action Quality Assessment Methods in Video Understanding [J]. Computer Science, 2022, 49(7): 79-88.
[10] DU Li-jun, TANG Xi-lu, ZHOU Jiao, CHEN Yu-lan, CHENG Jian. Alzheimer's Disease Classification Method Based on Attention Mechanism and Multi-task Learning [J]. Computer Science, 2022, 49(6A): 60-65.
[11] LI Xiao-wei, SHU Hui, GUANG Yan, ZHAI Yi, YANG Zi-ji. Survey of the Application of Natural Language Processing for Resume Analysis [J]. Computer Science, 2022, 49(6A): 66-73.
[12] DENG Kai, YANG Pin, LI Yi-zhou, YANG Xing, ZENG Fan-rui, ZHANG Zhen-yu. Fast and Transmissible Domain Knowledge Graph Construction Method [J]. Computer Science, 2022, 49(6A): 100-108.
[13] HUANG Shao-bin, SUN Xue-wei, LI Rong-sheng. Relation Classification Method Based on Cross-sentence Contextual Information for Neural Network [J]. Computer Science, 2022, 49(6A): 119-124.
[14] LIN Xi, CHEN Zi-zhuo, WANG Zhong-qing. Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning [J]. Computer Science, 2022, 49(6A): 144-149.
[15] KANG Yan, WU Zhi-wei, KOU Yong-qi, ZHANG Lan, XIE Si-yu, LI Hao. Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution [J]. Computer Science, 2022, 49(6A): 150-158.
Full text



No Suggested Reading articles found!