计算机科学 ›› 2020, Vol. 47 ›› Issue (1): 102-109.doi: 10.11896/jsjkx.190600060
董明刚1,2,姜振龙1,敬超1,2
DONG Ming-gang1,2,JIANG Zhen-long1,JING Chao1,2
摘要: 数据不平衡现象在现实生活中普遍存在。在处理不平衡数据时,传统的机器学习算法难以达到令人满意的效果。少数类样本合成上采样技术(Synthetic Minority Oversampling Technique,SMOTE)是一种有效的方法,但在多类不平衡数据中,边界点分布错乱和类别分布不连续变得更加复杂,导致合成的样本点会侵入其他类别区域,造成数据过泛化。鉴于基于海林格距离的决策树已被证明对不平衡数据具有不敏感性,文中结合海林格距离和SMOTE,提出了一种基于海林格距离和SMOTE的上采样算法(Based on Hellinger Distance and SMOTE Oversampling Algorithm,HDSMOTE)。首先,建立基于海林格距离的采样方向选择策略,通过比较少数类样本点的局部近邻域内的海林格距离的大小,来引导合成样本点的方向。其次,设计了基于海林格距离的采样质量评估策略,以免合成的样本点侵入其他类别的区域,降低过泛化的风险。最后,采用7种代表性的上采样算法和HDSMOTE算法对15个多类不平衡数据集进行预处理,使用决策树的分类器进行分类,以Precision,Recall,F-measure,G-mean和MAUC作为评价标准对各算法的性能进行评价。实验结果表明,相比于对比算法,HDSMOTE算法在以上评价标准上均有所提升:在Precision上最高提升了17.07%,在Recall上最高提升了21.74%,在F-measure上最高提升了19.63%,在G-mean上最高提升了16.37%,在MAUC上最高提升了8.51%。HDSMOTE相对于7种代表性的上采样方法,在处理多类不平衡数据时有更好的分类效果。
中图分类号:
[1]HE H,GARCIA E A.Learning from Imbalanced Data [J]. IEEE Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284. [2]KRAWCZYK,BARTOSZ.Learning from imbalanced data:open challenges and future directions [J].Progress in Artificial Intelligence,2016,5(4):221-232. [3]LI Y X,CHAI Y,HU Y Q,et al.Review of imbalanced data classification methods[J].Control and Decision,2019,34(4):673-688. [4]ZHAO N,ZHANG X F,ZHANG L J.Overview of Imbalanced Data Classification[J].Computer Science,2018,45(S1):22-27,57. [5]LI Y,LIU Z D,ZHANG H J.Review on ensemble algorithms for imbalanced data classification[J].Application Research of Computers,2014,31(5):1287-1291. [6]GUO H X,LI Y J,JENNIFER S,et al.Learning from class-imbalanced data:Review of methods and applications [J].Expert Systems with Applications,2017,73:220-239. [7]MIAO Z M,ZHAO L W,TIAN S W,et al.Class Imbalance Learning for Identifying NLOS in UWB Positioning[J].Journal of Signal Processing,2016,32(1):8-13. [8]XIA P P,ZHANG L.Application of Imbalanced Data Learning Algorithms to Similarity Learning[J].Pattern Recognition and Artificial Intelligence | Patt Recog Artif Intell,2014,27(12):1138-1145. [9]WEI W W,LI J J,GAO L B.Effective detection of sophisticated online banking fraud on extremely;imbalanced data[J].World Wide Web-internet & Web Information Systems,2013,16(4):449-475. [10]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357. [11]HE H,BAI Y,GARCIA E A,et al.ADASYN:Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]∥IEEE International Joint Conference on Neural Networks,2008(IJCNN 2008).IEEE,2008:1322-1328. [12]BARUA S,ISLAM M M,YAO X,et al.MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning[J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):405-425. [13]NEKOOEIMEHR I,LAI-YUEN S K.Adaptive semi-unsuper- vised weighted oversampling (A-SUWO) for imbalanced datasets[J].Expert Systems with Applications,2016,46:405-416. [14]PUNTUMAPON K,RAKTHAMAMON T,WAIYAMAI K. Clusterbased minority over-sampling for imbalanced datasets[J].IEICE TRANSACTIONS on Information and Systems,2016,99(12):3101-3109. [15]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A New Over-Sampling Method in Imbalanced Data Sets Learning[M]∥Advances in Intelligent Computing.Springer Berlin Heidelberg,2005:878-887. [16]ANAND R,MEHROTRA K,MOHAN C K,et al.Efficient classification for multiclass problems using modular neural networks[J].IEEE Transactions on Neural Networks,1995,6(1):117-124. [17]ZHU T,LIN Y,LIU Y.Synthetic minority oversampling technique for multiclass imbalance problems [J].Pattern Recognition,2017,72:327-340. [18]ABDI L,HASHEMI S.To combat multi-class imbalanced problems by means of over-sampling techniques[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(1):238-251. [19]YANG X,KUANG Q,ZHANG W,et al.AMDO:An Over-Sampling Technique for Multi-Class Imbalanced Problems[J].IEEE Transactions on Knowledge & Data Engineering,2018,30(9):1672-1685. [20]CIESLAK D A,CHAWLA N V.Learning Decision Trees for Unbalanced Data[C]∥Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Berlin:Sprin-ger,2008:241-256. [21]CIESLAK D A,HOENS T R,CHAWLA N V,et al.Hellinger distance decision trees are robust and skew-insensitive[J].Data Mining and Knowledge Discovery,2012,24(1):136-158. [22]UCI.Machine Learning Repository[OL].http://mlr.cs.uma-ss.edu/ml/datasets.html. [23]KEEL Dataset[OL].https://sci2s.ugr.es/keel/category.php?cat=clas&order=name#sub2. [24]HAND D J,TILL R J.A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems[J].Machine Learning,2001,45(2):171-186. [25]COHEN W W.Fast Effective Rule Induction[C]∥Twelfth International Conference on International Conference on Machine Learning.Elsevier,1995:115-123. |
[1] | 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙. 数据流概念漂移处理方法研究综述 Survey of Concept Drift Handling Methods in Data Streams 计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112 |
[2] | 周旭, 钱胜胜, 李章明, 方全, 徐常胜. 基于对偶变分多模态注意力网络的不完备社会事件分类方法 Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification 计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022 |
[3] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[4] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
[5] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[6] | 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111 |
[7] | 高振卓, 王志海, 刘海洋. 嵌入典型时间序列特征的随机Shapelet森林算法 Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features 计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226 |
[8] | 杨炳新, 郭艳蓉, 郝世杰, 洪日昌. 基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用 Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition 计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070 |
[9] | 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥. 视频理解中的动作质量评估方法综述 Survey on Action Quality Assessment Methods in Video Understanding 计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028 |
[10] | 邵欣欣. TI-FastText自动商品分类算法 TI-FastText Automatic Goods Classification Algorithm 计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089 |
[11] | 陈景年. 一种适于多分类问题的支持向量机加速方法 Acceleration of SVM for Multi-class Classification 计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149 |
[12] | 杨健楠, 张帆. 一种结合双注意力机制和层次网络结构的细碎农作物分类方法 Classification Method for Small Crops Combining Dual Attention Mechanisms and Hierarchical Network Structure 计算机科学, 2022, 49(6A): 353-357. https://doi.org/10.11896/jsjkx.210200169 |
[13] | 杨涵, 万游, 蔡洁萱, 方铭宇, 吴卓超, 金扬, 钱伟行. 基于步态分类辅助的虚拟IMU的行人导航方法 Pedestrian Navigation Method Based on Virtual Inertial Measurement Unit Assisted by GaitClassification 计算机科学, 2022, 49(6A): 759-763. https://doi.org/10.11896/jsjkx.211200148 |
[14] | 黄璞, 沈阳阳, 杜旭然, 杨章静. 基于局部约束特征线表示的人脸识别 Face Recognition Based on Locality Constrained Feature Line Representation 计算机科学, 2022, 49(6A): 429-433. https://doi.org/10.11896/jsjkx.210300169 |
[15] | 庞兴龙, 朱国胜. 基于半监督学习的网络流量分析研究 Survey of Network Traffic Analysis Based on Semi Supervised Learning 计算机科学, 2022, 49(6A): 544-554. https://doi.org/10.11896/jsjkx.210600131 |
|