%A DONG Ming-gang,JIANG Zhen-long,JING Chao %T Multi-class Imbalanced Learning Algorithm Based on Hellinger Distance and SMOTE Algorithm %0 Journal Article %D 2020 %J Computer Science %R 10.11896/jsjkx.190600060 %P 102-109 %V 47 %N 1 %U {https://www.jsjkx.com/CN/abstract/article_18828.shtml} %8 2020-01-15 %X Imbalanced data is common in real life.Traditional machine learning algorithms are difficult to achieve satisfied results on imbalanced data.The synthetic minority oversampling technique (SMOTE) is an efficient method to handle this problem.However,in multi-class imbalanced data,disordered distribution of boundary sample and discontinuous class distribution become more complicated,and the synthetic samples may invade other classes area,leading to over-generalization.In order to solve this issue,considering the algorithm based on Hellinger distance decision tree has been proved to be insensitive to imbalanced data,combining with Hellinger distance and SMOTE,this paper proposed an oversampling method SMOTE with Hellinger distance (HDSMOTE).Firstly,a sampling direction selection strategy was presented based on Hellinger distances of local neighborhood area,which can guide the direction of the synthesized sample.Secondly,a sampling quality evaluation strategy based on Hellinger distance was designed to avoid the synthesized sample into other classes,which can reduce the risk of over-generalization.Finally,to demonstrate the performance of HDSMOTE,15 multi-class imbalanced data sets were preprocessed by 7 representative oversampling algorithms and HDSMOTE algorithm,and were classified with C4.5 decision tree.Precision,Recall,F-measure,G-mean and MAUC are employed as the evaluation standards.Compared with competitive oversampling methods,the experimental results show that the HDSMOTE algorithm has improved in the these evaluation standards.It is increased by 17.07% in Precision,21.74% in Recall,19.63% in F-measure,16.37% in G-mean,and 8.51% in MAUC.HDSMOTE has better classification performance than the seven representative oversampling methods on multi-class imbalanced data.