计算机科学 ›› 2020, Vol. 47 ›› Issue (11): 88-94.doi: 10.11896/jsjkx.191000102
杨浩1, 陈红梅2
YANG Hao1, CHEN HONG-mei2
摘要: 欠采样和过采样是解决非平衡数据分类问题的常用方法。针对目前解决数据非平衡分布主要采用单一的采样方法可能会导致过拟合或重要样本丢失的问题,提出了一种基于量子进化算法的混合采样方法MSQEA(Mixed-Sampling method based on Quantum Evolutionary Algorithm)。该方法对多数类和少数类样本分别进行编码,组成量子进化算法中的个体种群,然后通过迭代得到合适的候选采样子集。针对得到的候选采样子集,首先使用欠采样移除多数类样本,避免了后续的过采样方法合成过多冗余的少数类样本的问题,然后采用过采样方法对少数类样本进行过采样,得到一个平衡数据集。同时,为了有效地评价量子个体的适应度,使用聚类算法对原始数据集进行聚类,构建一个有效的验证集来评价个体。为了验证MSQEA算法的性能,在KEEL网站下载的非平衡数据集上,采用SMO,J48和NB等作为分类算法测试不同采样算法处理后的分类性能。实验结果表明,MSQEA算法相比当前较为优秀的采样算法在多种分类器上具有更好的分类性能。
中图分类号:
[1] SUN A,LIM E P,LIU Y.On strategies for imbalanced textclassification using SVM:A comparative study[J].Decision Support Systems,2009,48(1):191-201. [2] MAZUROWSKI M A,HABAS P A,ZURADA J M,et al.Training neural network classifiers for medical decision making:The effects of imbalanced datasets on classification performance[J].Neural networks,2008,21(2-3):427-436. [3] CAO H,LI X L,WOON D Y K,et al.Integrated oversampling for imbalanced time series classification[J].IEEE Transactions on Knowledge and Data Engineering,2013,25(12):2809-2822. [4] DHEEPA V,DHANAPAL R,MANJUNATH G.Fraud detection in imbalanced datasets using cost based learning[J].Eur.J.Sci.Res,2012,91:486-490. [5] LIN W C,TSAI C F,HU Y H,et al.Clustering-based under-sampling in class-imbalanced data[J].Information Sciences,2017,409:17-26. [6] BARUA S,ISLAM M M,YAO X,et al.MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning[J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):405-425. [7] CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of artificial intelligence research,2002,16:321-357. [8] ZHU T,LIN Y,LIU Y.Synthetic minority oversampling technique for multiclass imbalance problems[J].Pattern Recognition,2017,72:327-340. [9] YANG H,CHEN H M.Ensemble classification algorithm forimbalanced data combining the local area density[J].Journal of Frontiers of Computer Science and Technology.2020,14(2):274-284. [10] CANO J R,HERRERA F,LOZANO M.Using evolutionary algorithms as instance selection for data reduction in KDD:an experimental study[J].IEEE Transactions on Evolutionary Computation,2003,7(6):561-575. [11] AHA D W,KIBLER D,ALBERT M K.Instance-based learning algorithms[J].Machine Learning,1991,6(1):37-66. [12] WILSON D R,MARTINEZ T R.Reduction techniques for instance-based learning algorithms[J].Machine Learning,2000,38(3):257-286. [13] TSAI C F,LIN W C,HU Y H,et al.Under-sampling class imbalanced datasets by combining clustering analysis and instance selection[J].Information Sciences,2019,477:47-54. [14] SHAO K,ZHAI Y,SUI H,et al.Learning from the imbalanced data based on quantum evolutionary[J].ICIC Express Letters,2014,8(6):1725-1729. [15] LI J,FONG S,WONG R K,et al.Adaptive multi-objectiveswarm fusion for imbalanced data classification[J].Information Fusion,2018,39:1-24. [16] WU Y F,LIANG J Y,WANG J H.Classification algorithmbased on hybrid sampling for unbalanced data[J].Journal of Frontiers of Computer Science and Technology,2019,13(2):342-349. [17] HU F,WANG L,ZHOU Y,et al.An oversampling method for imbalance data based on three-way decision model[J].Acta Electronica Sinica,2018,46(1):135-144. [18] HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//Inter-national Conference on Intelligent Computing.Springer,Berlin,Heidelberg,2005:878-887. [19] HAN K H,KIM J H.Quantum-inspired evolutionary algorithm for a class of combinatorial optimization[J].IEEE Trans on Evo-lutionary Computation,2002,6(6):580-593. [20] ALCALÁ-FDEZ J,FERNÁNDEZ A,LUENGO J,et al.Keeldata-mining software tool:data set repository,integration of algorithms and experimental analysis framework[J].Journal of Multiple-Valued Logic & Soft Computing,2011,17:255-287. [21] MORENO-TORRES J G,SÁEZ J A,HERRERA F.Study on the impact of partition-induced dataset shift on k-fold cross-validation[J].IEEE Transactions on Neural Networks and Learning Systems,2012,23(8):1304-1312. |
[1] | 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙. 数据流概念漂移处理方法研究综述 Survey of Concept Drift Handling Methods in Data Streams 计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112 |
[2] | 周旭, 钱胜胜, 李章明, 方全, 徐常胜. 基于对偶变分多模态注意力网络的不完备社会事件分类方法 Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification 计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022 |
[3] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[4] | 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111 |
[5] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
[6] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[7] | 高振卓, 王志海, 刘海洋. 嵌入典型时间序列特征的随机Shapelet森林算法 Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features 计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226 |
[8] | 杨炳新, 郭艳蓉, 郝世杰, 洪日昌. 基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用 Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition 计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070 |
[9] | 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥. 视频理解中的动作质量评估方法综述 Survey on Action Quality Assessment Methods in Video Understanding 计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028 |
[10] | 杜丽君, 唐玺璐, 周娇, 陈玉兰, 程建. 基于注意力机制和多任务学习的阿尔茨海默症分类 Alzheimer's Disease Classification Method Based on Attention Mechanism and Multi-task Learning 计算机科学, 2022, 49(6A): 60-65. https://doi.org/10.11896/jsjkx.201200072 |
[11] | 李小伟, 舒辉, 光焱, 翟懿, 杨资集. 自然语言处理在简历分析中的应用研究综述 Survey of the Application of Natural Language Processing for Resume Analysis 计算机科学, 2022, 49(6A): 66-73. https://doi.org/10.11896/jsjkx.210600134 |
[12] | 邓凯, 杨频, 李益洲, 杨星, 曾凡瑞, 张振毓. 一种可快速迁移的领域知识图谱构建方法 Fast and Transmissible Domain Knowledge Graph Construction Method 计算机科学, 2022, 49(6A): 100-108. https://doi.org/10.11896/jsjkx.210900018 |
[13] | 黄少滨, 孙雪薇, 李熔盛. 基于跨句上下文信息的神经网络关系分类方法 Relation Classification Method Based on Cross-sentence Contextual Information for Neural Network 计算机科学, 2022, 49(6A): 119-124. https://doi.org/10.11896/jsjkx.210600150 |
[14] | 林夕, 陈孜卓, 王中卿. 基于不平衡数据与集成学习的属性级情感分类 Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning 计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205 |
[15] | 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩. 融合Bert和图卷积的深度集成学习软件需求分类 Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution 计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065 |
|