基于海林格距离和SMOTE的多类不平衡学习算法

doi:10.11896/jsjkx.190600060

Abstract

Abstract: Imbalanced data is common in real life.Traditional machine learning algorithms are difficult to achieve satisfied results on imbalanced data.The synthetic minority oversampling technique (SMOTE) is an efficient method to handle this problem.However,in multi-class imbalanced data,disordered distribution of boundary sample and discontinuous class distribution become more complicated,and the synthetic samples may invade other classes area,leading to over-generalization.In order to solve this issue,considering the algorithm based on Hellinger distance decision tree has been proved to be insensitive to imbalanced data,combining with Hellinger distance and SMOTE,this paper proposed an oversampling method SMOTE with Hellinger distance (HDSMOTE).Firstly,a sampling direction selection strategy was presented based on Hellinger distances of local neighborhood area,which can guide the direction of the synthesized sample.Secondly,a sampling quality evaluation strategy based on Hellinger distance was designed to avoid the synthesized sample into other classes,which can reduce the risk of over-generalization.Finally,to demonstrate the performance of HDSMOTE,15 multi-class imbalanced data sets were preprocessed by 7 representative oversampling algorithms and HDSMOTE algorithm,and were classified with C4.5 decision tree.Precision,Recall,F-measure,G-mean and MAUC are employed as the evaluation standards.Compared with competitive oversampling methods,the experimental results show that the HDSMOTE algorithm has improved in the these evaluation standards.It is increased by 17.07% in Precision,21.74% in Recall,19.63% in F-measure,16.37% in G-mean,and 8.51% in MAUC.HDSMOTE has better classification performance than the seven representative oversampling methods on multi-class imbalanced data.

Key words: Classification, Hellinger distance, Multi-class imbalanced learning, Oversampling, SMOTE

CLC Number:

TP311

DONG Ming-gang,JIANG Zhen-long,JING Chao. Multi-class Imbalanced Learning Algorithm Based on Hellinger Distance and SMOTE Algorithm[J].Computer Science, 2020, 47(1): 102-109.

References

[1]HE H,GARCIA E A.Learning from Imbalanced Data [J]. IEEE Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284.
[2]KRAWCZYK,BARTOSZ.Learning from imbalanced data:open challenges and future directions [J].Progress in Artificial Intelligence,2016,5(4):221-232.
[3]LI Y X,CHAI Y,HU Y Q,et al.Review of imbalanced data classification methods[J].Control and Decision,2019,34(4):673-688.
[4]ZHAO N,ZHANG X F,ZHANG L J.Overview of Imbalanced Data Classification[J].Computer Science,2018,45(S1):22-27,57.
[5]LI Y,LIU Z D,ZHANG H J.Review on ensemble algorithms for imbalanced data classification[J].Application Research of Computers,2014,31(5):1287-1291.
[6]GUO H X,LI Y J,JENNIFER S,et al.Learning from class-imbalanced data:Review of methods and applications [J].Expert Systems with Applications,2017,73:220-239.
[7]MIAO Z M,ZHAO L W,TIAN S W,et al.Class Imbalance Learning for Identifying NLOS in UWB Positioning[J].Journal of Signal Processing,2016,32(1):8-13.
[8]XIA P P,ZHANG L.Application of Imbalanced Data Learning Algorithms to Similarity Learning[J].Pattern Recognition and Artificial Intelligence | Patt Recog Artif Intell,2014,27(12):1138-1145.
[9]WEI W W,LI J J,GAO L B.Effective detection of sophisticated online banking fraud on extremely;imbalanced data[J].World Wide Web-internet & Web Information Systems,2013,16(4):449-475.
[10]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[11]HE H,BAI Y,GARCIA E A,et al.ADASYN:Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]∥IEEE International Joint Conference on Neural Networks,2008(IJCNN 2008).IEEE,2008:1322-1328.
[12]BARUA S,ISLAM M M,YAO X,et al.MWMOTE－Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning[J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):405-425.
[13]NEKOOEIMEHR I,LAI-YUEN S K.Adaptive semi-unsuper- vised weighted oversampling (A-SUWO) for imbalanced datasets[J].Expert Systems with Applications,2016,46:405-416.
[14]PUNTUMAPON K,RAKTHAMAMON T,WAIYAMAI K. Clusterbased minority over-sampling for imbalanced datasets[J].IEICE TRANSACTIONS on Information and Systems,2016,99(12):3101-3109.
[15]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A New Over-Sampling Method in Imbalanced Data Sets Learning[M]∥Advances in Intelligent Computing.Springer Berlin Heidelberg,2005:878-887.
[16]ANAND R,MEHROTRA K,MOHAN C K,et al.Efficient classification for multiclass problems using modular neural networks[J].IEEE Transactions on Neural Networks,1995,6(1):117-124.
[17]ZHU T,LIN Y,LIU Y.Synthetic minority oversampling technique for multiclass imbalance problems [J].Pattern Recognition,2017,72:327-340.
[18]ABDI L,HASHEMI S.To combat multi-class imbalanced problems by means of over-sampling techniques[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(1):238-251.
[19]YANG X,KUANG Q,ZHANG W,et al.AMDO:An Over-Sampling Technique for Multi-Class Imbalanced Problems[J].IEEE Transactions on Knowledge & Data Engineering,2018,30(9):1672-1685.
[20]CIESLAK D A,CHAWLA N V.Learning Decision Trees for Unbalanced Data[C]∥Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Berlin:Sprin-ger,2008:241-256.
[21]CIESLAK D A,HOENS T R,CHAWLA N V,et al.Hellinger distance decision trees are robust and skew-insensitive[J].Data Mining and Knowledge Discovery,2012,24(1):136-158.
[22]UCI.Machine Learning Repository[OL].http://mlr.cs.uma-ss.edu/ml/datasets.html.
[23]KEEL Dataset[OL].https://sci2s.ugr.es/keel/category.php?cat=clas&order=name#sub2.
[24]HAND D J,TILL R J.A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems[J].Machine Learning,2001,45(2):171-186.
[25]COHEN W W.Fast Effective Rule Induction[C]∥Twelfth International Conference on International Conference on Machine Learning.Elsevier,1995:115-123.

Related Articles 15

[1]	CHEN Zhi-qiang, HAN Meng, LI Mu-hang, WU Hong-xin, ZHANG Xi-long. Survey of Concept Drift Handling Methods in Data Streams [J]. Computer Science, 2022, 49(9): 14-32.
[2]	ZHOU Xu, QIAN Sheng-sheng, LI Zhang-ming, FANG Quan, XU Chang-sheng. Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification [J]. Computer Science, 2022, 49(9): 132-138.
[3]	HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[4]	TAN Ying-ying, WANG Jun-li, ZHANG Chao-bo. Review of Text Classification Methods Based on Graph Convolutional Network [J]. Computer Science, 2022, 49(8): 205-216.
[5]	YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[6]	WU Hong-xin, HAN Meng, CHEN Zhi-qiang, ZHANG Xi-long, LI Mu-hang. Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning [J]. Computer Science, 2022, 49(8): 12-25.
[7]	GAO Zhen-zhuo, WANG Zhi-hai, LIU Hai-yang. Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features [J]. Computer Science, 2022, 49(7): 40-49.
[8]	YANG Bing-xin, GUO Yan-rong, HAO Shi-jie, Hong Ri-chang. Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition [J]. Computer Science, 2022, 49(7): 57-63.
[9]	ZHANG Hong-bo, DONG Li-jia, PAN Yu-biao, HSIAO Tsung-chih, ZHANG Hui-zhen, DU Ji-xiang. Survey on Action Quality Assessment Methods in Video Understanding [J]. Computer Science, 2022, 49(7): 79-88.
[10]	DU Li-jun, TANG Xi-lu, ZHOU Jiao, CHEN Yu-lan, CHENG Jian. Alzheimer's Disease Classification Method Based on Attention Mechanism and Multi-task Learning [J]. Computer Science, 2022, 49(6A): 60-65.
[11]	LI Xiao-wei, SHU Hui, GUANG Yan, ZHAI Yi, YANG Zi-ji. Survey of the Application of Natural Language Processing for Resume Analysis [J]. Computer Science, 2022, 49(6A): 66-73.
[12]	DENG Kai, YANG Pin, LI Yi-zhou, YANG Xing, ZENG Fan-rui, ZHANG Zhen-yu. Fast and Transmissible Domain Knowledge Graph Construction Method [J]. Computer Science, 2022, 49(6A): 100-108.
[13]	HUANG Shao-bin, SUN Xue-wei, LI Rong-sheng. Relation Classification Method Based on Cross-sentence Contextual Information for Neural Network [J]. Computer Science, 2022, 49(6A): 119-124.
[14]	LIN Xi, CHEN Zi-zhuo, WANG Zhong-qing. Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning [J]. Computer Science, 2022, 49(6A): 144-149.
[15]	KANG Yan, WU Zhi-wei, KOU Yong-qi, ZHANG Lan, XIE Si-yu, LI Hao. Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution [J]. Computer Science, 2022, 49(6A): 150-158.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Multi-class Imbalanced Learning Algorithm Based on Hellinger Distance and SMOTE Algorithm

PDF (PC)