计算机科学 ›› 2020, Vol. 47 ›› Issue (11): 88-94.doi: 10.11896/jsjkx.191000102

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于量子进化算法的非平衡数据混合采样算法

杨浩1, 陈红梅2   

  1. 1 西南交通大学云计算与智能技术高校重点实验室 成都 611756
    2 西南交通大学信息科学与技术学院 成都 611756
  • 收稿日期:2019-10-16 修回日期:2020-03-29 出版日期:2020-11-15 发布日期:2020-11-05
  • 通讯作者: 陈红梅(hmchen@swjtu.edu.cn)
  • 作者简介:apologise@my.swjtu.edu.cn
  • 基金资助:
    国家自然科学基金(61572406,61976182);四川省国际科技创新合作重点项目(2019YFH0097)

Mixed-sampling Method for Imbalanced Data Based on Quantum Evolutionary Algorithm

YANG Hao1, CHEN HONG-mei2   

  1. 1 Key Laboratory of Cloud Computing and Intelligent Technology,Southwest Jiaotong University,Chengdu 611756,China
    2 School of Information Science and Technology,Southwest Jiaotong University,Chengdu 611756,China
  • Received:2019-10-16 Revised:2020-03-29 Online:2020-11-15 Published:2020-11-05
  • About author:YANG Hao,born in 1995,postgraduate,is a member of China Computer Federation.His main research interests include database technology and data mining.
    CHEN Hong-mei,born in 1971,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.Her main research interests include granular calculation,rough sets and intelligent information processing.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61572406,61976182) and Key Program for International S&T Cooperation of Sichuan Province (2019YFH0097).

摘要: 欠采样和过采样是解决非平衡数据分类问题的常用方法。针对目前解决数据非平衡分布主要采用单一的采样方法可能会导致过拟合或重要样本丢失的问题,提出了一种基于量子进化算法的混合采样方法MSQEA(Mixed-Sampling method based on Quantum Evolutionary Algorithm)。该方法对多数类和少数类样本分别进行编码,组成量子进化算法中的个体种群,然后通过迭代得到合适的候选采样子集。针对得到的候选采样子集,首先使用欠采样移除多数类样本,避免了后续的过采样方法合成过多冗余的少数类样本的问题,然后采用过采样方法对少数类样本进行过采样,得到一个平衡数据集。同时,为了有效地评价量子个体的适应度,使用聚类算法对原始数据集进行聚类,构建一个有效的验证集来评价个体。为了验证MSQEA算法的性能,在KEEL网站下载的非平衡数据集上,采用SMO,J48和NB等作为分类算法测试不同采样算法处理后的分类性能。实验结果表明,MSQEA算法相比当前较为优秀的采样算法在多种分类器上具有更好的分类性能。

关键词: 非平衡数据, 分类, 混合采样, 量子进化算法

Abstract: The under-sampling and over-sampling are the common methods for solving the classification problem in an imbalanced data.This paper focuses on the overfitting or lose valuable samples problems brought by using a single sampling method.A mixed sampling method,namely MSQEA,based on quantum evolutionary algorithm is proposed.In MSQEA,the majority class samples and minority class samples are firstly encoded separately to form individuals of population in the quantum evolutionary algorithm,and then an appropriate candidate sampling subset is obtained through optimization iterations.After that,the majority samples in candidate subset are removed by under-sampling to avoid the problem of subsequent oversampling method to generate overmuch redundant samples.Then,an oversampling method is used to generate the minority samples.Additionally,in order to effectively evaluate the fitness of quantum individuals,clustering technique is used to cluster the dataset and the effective validation sets for the evaluation of individuals are obtained.Experiments are conducted to evaluate the performance of algorithm MSQEA.The imbalanced data sets are downloaded from KEEL website,and SMO,J48 and NB are used as classifiers to verify the performance of a classifier after data preprocessing by different sampling methods.Experimental results show that the classification performance of MSQEA is better than some state-of-the art sampling methods.

Key words: Classification, Imbalanced data, Mixed-sampling, Quantum evolutionary algorithm

中图分类号: 

  • TP391
[1] SUN A,LIM E P,LIU Y.On strategies for imbalanced textclassification using SVM:A comparative study[J].Decision Support Systems,2009,48(1):191-201.
[2] MAZUROWSKI M A,HABAS P A,ZURADA J M,et al.Training neural network classifiers for medical decision making:The effects of imbalanced datasets on classification performance[J].Neural networks,2008,21(2-3):427-436.
[3] CAO H,LI X L,WOON D Y K,et al.Integrated oversampling for imbalanced time series classification[J].IEEE Transactions on Knowledge and Data Engineering,2013,25(12):2809-2822.
[4] DHEEPA V,DHANAPAL R,MANJUNATH G.Fraud detection in imbalanced datasets using cost based learning[J].Eur.J.Sci.Res,2012,91:486-490.
[5] LIN W C,TSAI C F,HU Y H,et al.Clustering-based under-sampling in class-imbalanced data[J].Information Sciences,2017,409:17-26.
[6] BARUA S,ISLAM M M,YAO X,et al.MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning[J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):405-425.
[7] CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of artificial intelligence research,2002,16:321-357.
[8] ZHU T,LIN Y,LIU Y.Synthetic minority oversampling technique for multiclass imbalance problems[J].Pattern Recognition,2017,72:327-340.
[9] YANG H,CHEN H M.Ensemble classification algorithm forimbalanced data combining the local area density[J].Journal of Frontiers of Computer Science and Technology.2020,14(2):274-284.
[10] CANO J R,HERRERA F,LOZANO M.Using evolutionary algorithms as instance selection for data reduction in KDD:an experimental study[J].IEEE Transactions on Evolutionary Computation,2003,7(6):561-575.
[11] AHA D W,KIBLER D,ALBERT M K.Instance-based learning algorithms[J].Machine Learning,1991,6(1):37-66.
[12] WILSON D R,MARTINEZ T R.Reduction techniques for instance-based learning algorithms[J].Machine Learning,2000,38(3):257-286.
[13] TSAI C F,LIN W C,HU Y H,et al.Under-sampling class imbalanced datasets by combining clustering analysis and instance selection[J].Information Sciences,2019,477:47-54.
[14] SHAO K,ZHAI Y,SUI H,et al.Learning from the imbalanced data based on quantum evolutionary[J].ICIC Express Letters,2014,8(6):1725-1729.
[15] LI J,FONG S,WONG R K,et al.Adaptive multi-objectiveswarm fusion for imbalanced data classification[J].Information Fusion,2018,39:1-24.
[16] WU Y F,LIANG J Y,WANG J H.Classification algorithmbased on hybrid sampling for unbalanced data[J].Journal of Frontiers of Computer Science and Technology,2019,13(2):342-349.
[17] HU F,WANG L,ZHOU Y,et al.An oversampling method for imbalance data based on three-way decision model[J].Acta Electronica Sinica,2018,46(1):135-144.
[18] HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//Inter-national Conference on Intelligent Computing.Springer,Berlin,Heidelberg,2005:878-887.
[19] HAN K H,KIM J H.Quantum-inspired evolutionary algorithm for a class of combinatorial optimization[J].IEEE Trans on Evo-lutionary Computation,2002,6(6):580-593.
[20] ALCALÁ-FDEZ J,FERNÁNDEZ A,LUENGO J,et al.Keeldata-mining software tool:data set repository,integration of algorithms and experimental analysis framework[J].Journal of Multiple-Valued Logic & Soft Computing,2011,17:255-287.
[21] MORENO-TORRES J G,SÁEZ J A,HERRERA F.Study on the impact of partition-induced dataset shift on k-fold cross-validation[J].IEEE Transactions on Neural Networks and Learning Systems,2012,23(8):1304-1312.
[1] 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙.
数据流概念漂移处理方法研究综述
Survey of Concept Drift Handling Methods in Data Streams
计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112
[2] 周旭, 钱胜胜, 李章明, 方全, 徐常胜.
基于对偶变分多模态注意力网络的不完备社会事件分类方法
Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification
计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022
[3] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[4] 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航.
监督和半监督学习下的多标签分类综述
Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning
计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[5] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[6] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[7] 高振卓, 王志海, 刘海洋.
嵌入典型时间序列特征的随机Shapelet森林算法
Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features
计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[8] 杨炳新, 郭艳蓉, 郝世杰, 洪日昌.
基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用
Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition
计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070
[9] 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥.
视频理解中的动作质量评估方法综述
Survey on Action Quality Assessment Methods in Video Understanding
计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028
[10] 杜丽君, 唐玺璐, 周娇, 陈玉兰, 程建.
基于注意力机制和多任务学习的阿尔茨海默症分类
Alzheimer's Disease Classification Method Based on Attention Mechanism and Multi-task Learning
计算机科学, 2022, 49(6A): 60-65. https://doi.org/10.11896/jsjkx.201200072
[11] 李小伟, 舒辉, 光焱, 翟懿, 杨资集.
自然语言处理在简历分析中的应用研究综述
Survey of the Application of Natural Language Processing for Resume Analysis
计算机科学, 2022, 49(6A): 66-73. https://doi.org/10.11896/jsjkx.210600134
[12] 邓凯, 杨频, 李益洲, 杨星, 曾凡瑞, 张振毓.
一种可快速迁移的领域知识图谱构建方法
Fast and Transmissible Domain Knowledge Graph Construction Method
计算机科学, 2022, 49(6A): 100-108. https://doi.org/10.11896/jsjkx.210900018
[13] 黄少滨, 孙雪薇, 李熔盛.
基于跨句上下文信息的神经网络关系分类方法
Relation Classification Method Based on Cross-sentence Contextual Information for Neural Network
计算机科学, 2022, 49(6A): 119-124. https://doi.org/10.11896/jsjkx.210600150
[14] 林夕, 陈孜卓, 王中卿.
基于不平衡数据与集成学习的属性级情感分类
Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning
计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205
[15] 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩.
融合Bert和图卷积的深度集成学习软件需求分类
Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution
计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!