Computer Science ›› 2020, Vol. 47 ›› Issue (1): 102-109.doi: 10.11896/jsjkx.190600060

• Database & Big Data & Data Science • Previous Articles     Next Articles

Multi-class Imbalanced Learning Algorithm Based on Hellinger Distance and SMOTE Algorithm

DONG Ming-gang1,2,JIANG Zhen-long1,JING Chao1,2   

  1. (College of Information Science and Engineering,Guilin University of Technology,Guilin,Guangxi 541004,China)1;
    (Guangxi Key Laboratory of Embedded Technology and Intelligent System,Guilin,Guangxi 541004,China)2
  • Received:2019-06-12 Published:2020-01-19
  • About author:DONG Ming-gang,born in 1977,Ph.D,professor,is senior member of China Computer Federation(CCF).His main research interests include intelligent computing,multi-objective optimization and machine learning;JING Chao,born in 1983,Ph.D,asso-ciate professor.His main research inte-rests include cloud computing and big data processing,workflow scheduling on cloud data center and deep reinforcement learning.
  • Supported by:
    This work supported by the National Natural Science Foundation of China (61563012,61802085),Natural Science Foundation of Guangxi, China (2014GXNSFAA118371,2015GXNSFBA139260) and Guangxi Key Laboratory of Embedded Technology and Intelligent System Foundation (2018A-04).

Abstract: Imbalanced data is common in real life.Traditional machine learning algorithms are difficult to achieve satisfied results on imbalanced data.The synthetic minority oversampling technique (SMOTE) is an efficient method to handle this problem.However,in multi-class imbalanced data,disordered distribution of boundary sample and discontinuous class distribution become more complicated,and the synthetic samples may invade other classes area,leading to over-generalization.In order to solve this issue,considering the algorithm based on Hellinger distance decision tree has been proved to be insensitive to imbalanced data,combining with Hellinger distance and SMOTE,this paper proposed an oversampling method SMOTE with Hellinger distance (HDSMOTE).Firstly,a sampling direction selection strategy was presented based on Hellinger distances of local neighborhood area,which can guide the direction of the synthesized sample.Secondly,a sampling quality evaluation strategy based on Hellinger distance was designed to avoid the synthesized sample into other classes,which can reduce the risk of over-generalization.Finally,to demonstrate the performance of HDSMOTE,15 multi-class imbalanced data sets were preprocessed by 7 representative oversampling algorithms and HDSMOTE algorithm,and were classified with C4.5 decision tree.Precision,Recall,F-measure,G-mean and MAUC are employed as the evaluation standards.Compared with competitive oversampling methods,the experimental results show that the HDSMOTE algorithm has improved in the these evaluation standards.It is increased by 17.07% in Precision,21.74% in Recall,19.63% in F-measure,16.37% in G-mean,and 8.51% in MAUC.HDSMOTE has better classification performance than the seven representative oversampling methods on multi-class imbalanced data.

Key words: SMOTE, Oversampling, Hellinger distance, Multi-class imbalanced learning, Classification

CLC Number: 

  • TP311
[1]HE H,GARCIA E A.Learning from Imbalanced Data [J]. IEEE Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284.
[2]KRAWCZYK,BARTOSZ.Learning from imbalanced data:open challenges and future directions [J].Progress in Artificial Intelligence,2016,5(4):221-232.
[3]LI Y X,CHAI Y,HU Y Q,et al.Review of imbalanced data classification methods[J].Control and Decision,2019,34(4):673-688.
[4]ZHAO N,ZHANG X F,ZHANG L J.Overview of Imbalanced Data Classification[J].Computer Science,2018,45(S1):22-27,57.
[5]LI Y,LIU Z D,ZHANG H J.Review on ensemble algorithms for imbalanced data classification[J].Application Research of Computers,2014,31(5):1287-1291.
[6]GUO H X,LI Y J,JENNIFER S,et al.Learning from class-imbalanced data:Review of methods and applications [J].Expert Systems with Applications,2017,73:220-239.
[7]MIAO Z M,ZHAO L W,TIAN S W,et al.Class Imbalance Learning for Identifying NLOS in UWB Positioning[J].Journal of Signal Processing,2016,32(1):8-13.
[8]XIA P P,ZHANG L.Application of Imbalanced Data Learning Algorithms to Similarity Learning[J].Pattern Recognition and Artificial Intelligence | Patt Recog Artif Intell,2014,27(12):1138-1145.
[9]WEI W W,LI J J,GAO L B.Effective detection of sophisticated online banking fraud on extremely;imbalanced data[J].World Wide Web-internet & Web Information Systems,2013,16(4):449-475.
[10]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[11]HE H,BAI Y,GARCIA E A,et al.ADASYN:Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]∥IEEE International Joint Conference on Neural Networks,2008(IJCNN 2008).IEEE,2008:1322-1328.
[12]BARUA S,ISLAM M M,YAO X,et al.MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning[J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):405-425.
[13]NEKOOEIMEHR I,LAI-YUEN S K.Adaptive semi-unsuper- vised weighted oversampling (A-SUWO) for imbalanced datasets[J].Expert Systems with Applications,2016,46:405-416.
[14]PUNTUMAPON K,RAKTHAMAMON T,WAIYAMAI K. Clusterbased minority over-sampling for imbalanced datasets[J].IEICE TRANSACTIONS on Information and Systems,2016,99(12):3101-3109.
[15]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A New Over-Sampling Method in Imbalanced Data Sets Learning[M]∥Advances in Intelligent Computing.Springer Berlin Heidelberg,2005:878-887.
[16]ANAND R,MEHROTRA K,MOHAN C K,et al.Efficient classification for multiclass problems using modular neural networks[J].IEEE Transactions on Neural Networks,1995,6(1):117-124.
[17]ZHU T,LIN Y,LIU Y.Synthetic minority oversampling technique for multiclass imbalance problems [J].Pattern Recognition,2017,72:327-340.
[18]ABDI L,HASHEMI S.To combat multi-class imbalanced problems by means of over-sampling techniques[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(1):238-251.
[19]YANG X,KUANG Q,ZHANG W,et al.AMDO:An Over-Sampling Technique for Multi-Class Imbalanced Problems[J].IEEE Transactions on Knowledge & Data Engineering,2018,30(9):1672-1685.
[20]CIESLAK D A,CHAWLA N V.Learning Decision Trees for Unbalanced Data[C]∥Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Berlin:Sprin-ger,2008:241-256.
[21]CIESLAK D A,HOENS T R,CHAWLA N V,et al.Hellinger distance decision trees are robust and skew-insensitive[J].Data Mining and Knowledge Discovery,2012,24(1):136-158.
[22]UCI.Machine Learning Repository[OL].http://mlr.cs.uma-ss.edu/ml/datasets.html.
[23]KEEL Dataset[OL].https://sci2s.ugr.es/keel/category.php?cat=clas&order=name#sub2.
[24]HAND D J,TILL R J.A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems[J].Machine Learning,2001,45(2):171-186.
[25]COHEN W W.Fast Effective Rule Induction[C]∥Twelfth International Conference on International Conference on Machine Learning.Elsevier,1995:115-123.
[1] YU Shan-shan, SU Jin-dian, LI Peng-fei. Sentiment Classification Method for Sentences via Self-attention [J]. Computer Science, 2020, 47(4): 204-210.
[2] GAO Nan,LI Li-juan,Wei-william LEE,ZHU Jian-ming. Keywords Extraction Method Based on Semantic Feature Fusion [J]. Computer Science, 2020, 47(3): 110-115.
[3] LIU Jun-qi,LI Zhi,ZHANG Xue-yang. Review of Maritime Target Detection in Visible Bands of Optical Remote Sensing Images [J]. Computer Science, 2020, 47(3): 116-123.
[4] XU Yuan-yin,CHAI Yu-mei,WANG Li-ming,LIU Zhen. Emotional Sentence Classification Method Based on OCC Model and Bayesian Network [J]. Computer Science, 2020, 47(3): 222-230.
[5] XU Mao,HOU Jin,WU Pei-jun,LIU Yu-ling,LV Zhi-liang. Convolutional Neural Networks Based on Time-Frequency Characteristics for Modulation Classification [J]. Computer Science, 2020, 47(2): 175-179.
[6] WANG Li-hua,DU Ming-hui,LIANG Ya-ling. Classification Net Based on Angular Feature [J]. Computer Science, 2020, 47(2): 83-87.
[7] JIANG Peng-fei, WEI Song-jie. Classification and Evaluation of Mobile Application Network Behavior Based on Deep Forest and CWGAN-GP [J]. Computer Science, 2020, 47(1): 287-292.
[8] WANG Li-zhi,MU Xiao-dong,LIU Hong-lan. Using SVM Method Optimized by Improved Particle Swarm Optimization to Analyze Emotion of Chinese Text [J]. Computer Science, 2020, 47(1): 231-236.
[9] JIANG Ze-tao, QIN Jia-qi, HU Shuo. Multi-spectral Scene Recognition Method Based on Multi-way Convolution Neural Network [J]. Computer Science, 2019, 46(9): 265-270.
[10] CHEN Xiao-jun, XIANG Yang. STransH:A Revised Translation-based Model for Knowledge Representation [J]. Computer Science, 2019, 46(9): 184-189.
[11] DU Zhen, MA Li-peng, SUN Guo-zi. Network Traffic Anomaly Detection Based on Wavelet Analysis [J]. Computer Science, 2019, 46(8): 178-182.
[12] CAO Yi-qin, WU Dan, HUANG Xiao-sheng. Track Defect Image Classification Based on Improved Ant Colony Algorithm [J]. Computer Science, 2019, 46(8): 292-297.
[13] SHEN Chen-lin, ZHANG Lu, WU Liang-qing, LI Shou-shan. Sentiment Classification Towards Question-Answering Based on Bidirectional Attention Mechanism [J]. Computer Science, 2019, 46(7): 151-156.
[14] HAN Hui,WANG Li-ming,CHAI Yu-mei,LIU Zhen. Text Sentiment Classification Based on Deep Forests with Enhanced Features [J]. Computer Science, 2019, 46(7): 172-179.
[15] ZHANG Xue-fu, ZENG Pan, JIN Min. Cancer Classification Prediction Model Based on Correlation and Similarity [J]. Computer Science, 2019, 46(7): 300-307.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[2] LIANG Jun-bin, ZHOU Xiang, WANG Tian and LI Tao-shen. Research Progress on Data Collection in Mobile Low-duty-cycle Wireless Sensor Networks[J]. Computer Science, 2018, 45(4): 19 -24, 52 .
[3] ZHAO Li-bo, LIU Qi, FU Fang-ling and HE Ling. Automatic Detection of Hypernasality Grades Based on Discrete Wavelet Transformation and Cepstrum Analysis[J]. Computer Science, 2018, 45(4): 278 -284 .
[4] YANG Pei-an, WU Yang, SU Li-ya, LIU Bao-xu. Overview of Threat Intelligence Sharing Technologies in Cyberspace[J]. Computer Science, 2018, 45(6): 9 -18,26 .
[5] HU Qing-cheng, ZHANG Yong, XING Chun-xiao. K-clique Heuristic Algorithm for Influence Maximization in Social Network[J]. Computer Science, 2018, 45(6): 32 -35 .
[6] ZHANG Pan-pan, PENG Chang-gen, HAO Chen-yan. Privacy Protection Model and Privacy Metric Methods Based on Privacy Preference[J]. Computer Science, 2018, 45(6): 130 -134 .
[7] SHEN Xia-jiong, ZHANG Jun-tao, HAN Dao-jun. Short-term Traffic Flow Prediction Model Based on Gradient Boosting Regression Tree[J]. Computer Science, 2018, 45(6): 222 -227,264 .
[8] ZENG An, GAO Cheng-si and XU Xiao-qiang. Collaborative Filtering Algorithm Incorporating Time Factor and User Preference Properties[J]. Computer Science, 2017, 44(9): 243 -249 .
[9] LU Er-jie, CHEN Luan, LI Jian, HUANG Qi, ZHANG Zhen-yuan, JING Shi and ZHOU Tong-han. Research on Improved Cache Replacement Algorithm Serving for Wind Power System[J]. Computer Science, 2017, 44(9): 230 -233, 238 .
[10] QIAN Ji-de, CHEN Bin, QIAN Ji-ye, ZHAO Heng-jun, CHEN Gang. Machine Vision Based Inspection Method of Mura Defect for LCD[J]. Computer Science, 2018, 45(6): 296 -300,313 .