计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 83-88.doi: 10.11896/jsjkx.190400053
周晓敏, 曹付元, 余丽琴
ZHOU Xiao-min, CAO Fu-yuan, YU Li-qin
摘要: 重采样技术由于简单、直观,逐渐成为解决非平衡数据分类问题的一个重要方向。但是在数据集很小的情况下,重采样技术中的欠采样可能会丢失数据集的重要信息,因此过采样是非平衡数据分类问题的研究重点。现有的过采样方法虽然有效地解决了类间不平衡问题,但是有可能造成少数类的密集区域更加密集,甚至引起样本重叠。此外,由于少数类样本可能存在噪音,现有的过采样方法可能会在噪音周围生成新样本,从而造成少数类样本的分布更加混乱。针对这些问题,文中提出了一种基于样本分层的双向过采样方法,该方法首先基于最高密度点和类内平均距离将少数类样本划分成密集层和稀疏层,然后对密集层边界区样本和稀疏层的样本进行双向过采样。为了验证所提算法的有效性,在9个UCI数据集上将提出的算法和其他过采样算法进行了比较。实验结果和Friedman等检验结果显示,提出的算法在处理非平衡数据分类问题时具有一定优势。
中图分类号:
[1]HE H,GARCIA E A.Learning from imbalanced data [J].IEEE Transactions on Knowledge and Data Engineering,2009,21(9):1263-1284.[2]ZHENG Z,WU X,SRIHARI R,et al.Feature selection for text categorization on imbalanced data [J].SIGKDD Explorations,2004,6(1):80-89.[3]HUANG H,HE Q M,CHEN Q,et al.Rare category detection algorithm based on weighted boundary degree [J].Journal of Software,2012,23(5):1195-1208.(in Chinese) 黄浩,何钦铭,陈奇,等.基于加权边界度的稀有类检测算法[J].软件学报,2012,23(5):1195-1208.[4]LOU X J,SUN Y X,LIU H T.Clustering boundary over-sampling classification method for imbalanced data sets [J].Journal of Zhejiang University (Engineering Science),2013,47(6):944-950.(in Chinese) 楼晓俊,孙雨轩,刘海涛.聚类边界过采样不平衡数据分类方法[J].浙江大学学报(工学版),2013,47(6):944-950.[5]WANG H,ZHOU Z M.An over sampling algorithm based on clustering [J].Journal of Shandong University (Engineering Science),2018,48(3):134-139.(in Chinese) 王换,周忠眉.一种基于聚类的过抽样算法[J].山东大学学报(工学版),2018,48(3):134-139.[6]WANG J H,DUAN B Q.Research on the SMOTE method based on density [J].CAAI Transactions on Intelligent Systems,2017(6):865-872.(in Chinese) 王俊红,段冰倩.一种基于密度的SMOTE方法研究[J].智能系统学报,2017(6):865-872.[7]ZHU Y Q,DENG W B.A method using clustering and sampling approach for imbalance data [J].Journal of Nanjing University (Natural Sciences),2015,51(2):421-429.(in Chinese) 朱亚奇,邓维斌.一种基于不平衡数据的聚类抽样方法[J].南京大学学报(自然科学版),2015,51(2):421-429.[8]YU Q,JIANG S J,ZHANG Y M,et al.The impact study of class imbalance on the performance of software defect prediction models [J].Chinese Journal of Computers,2018,41(4):809-824.(in Chinese) 于巧,姜淑娟,张艳梅,等.分类不平衡对软件缺陷预测模型性能的影响研究 [J].计算机学报,2018,41(4):809-824.[9]LI X F,LI J,DONG Y F,et al.A new learning algorithm for imbalanced data—PCBoost [J].Chinese Journal of Computers,2012,35(2):202-209.(in Chinese) 李雄飞,李军,董元方,等.一种新的不平衡数据学习算法PCBoost [J].计算机学报,2012,35(2):202-209.[10]JIN X,WANG L,SUN G Z,et al.Under-sampling method for unbalanced data based on centroid space [J].Computer Science,2019,46(2):50-55.(in Chinese) 金旭,王磊,孙国梓,等.一种基于质心空间的不均衡数据欠采样方法 [J].计算机科学,2019,46(2):50-55.[11]BARUA S,ISLAM M M,YAO X,et al.MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning [J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):405-425.[12]HE H,BAI Y,GARCIA E A,et al.ADASYN:adaptive synthe- tic sampling approach for imbalanced learning[C]//IEEE International Joint Conference on Neural Networks.IEEE Xplore,2008:1322-1328.[13]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique [J].Journal of Artificial Intelligence Research,2011,16(1):321-357.[14]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A new over-sampling method in imbalanced data sets learning [C]//International Conference on Intelligent Computing,Springer-Verlag Berlin Heidelberg,2005,3644(5):878-887.[15]GEORGIOS D,FERNANDO B,FELIX L.Improving imba- lanced learning through a heuristic oversampling method based on k-means and smote [J].Information Sciences,2018,465:1-20.[16]ZHANG X,SONG Q,WANG G,et al.A dissimilarity-based imbalance data classification algorithm [J].Applied Intelligence,2015,42(3):544-565.[17]XU Y,YANG Z,ZHANG Y,et al.A maximum margin and mi- nimum volume hyper-spheres machine with pinball loss for imbalanced data classification [J].Knowledge-Based Systems,2016,95:75-85.[18]NEKOOEIMEHR I,LAI-YUEN S K.Adaptive semi-unsuper- vised weighted oversampling (A-SUWO) for imbalanced datasets [J].Expert Systems with Applications,2016,46:405-416.[19]ANWAR N,JONES G,GANESH S.Measurement of data complexity for classification problems with unbalanced data [J].Statistical Analysis and Data Mining,2014,7(3):194-211.[20]DEMSAR J.Statistical comparisons of classifiers over multiple data sets [J].Journal of Machine Learning Research,2006,7(1):1-30. |
[1] | 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙. 数据流概念漂移处理方法研究综述 Survey of Concept Drift Handling Methods in Data Streams 计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112 |
[2] | 周旭, 钱胜胜, 李章明, 方全, 徐常胜. 基于对偶变分多模态注意力网络的不完备社会事件分类方法 Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification 计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022 |
[3] | 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111 |
[4] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[5] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
[6] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[7] | 高振卓, 王志海, 刘海洋. 嵌入典型时间序列特征的随机Shapelet森林算法 Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features 计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226 |
[8] | 杨炳新, 郭艳蓉, 郝世杰, 洪日昌. 基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用 Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition 计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070 |
[9] | 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥. 视频理解中的动作质量评估方法综述 Survey on Action Quality Assessment Methods in Video Understanding 计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028 |
[10] | 黄璞, 沈阳阳, 杜旭然, 杨章静. 基于局部约束特征线表示的人脸识别 Face Recognition Based on Locality Constrained Feature Line Representation 计算机科学, 2022, 49(6A): 429-433. https://doi.org/10.11896/jsjkx.210300169 |
[11] | 杨涵, 万游, 蔡洁萱, 方铭宇, 吴卓超, 金扬, 钱伟行. 基于步态分类辅助的虚拟IMU的行人导航方法 Pedestrian Navigation Method Based on Virtual Inertial Measurement Unit Assisted by GaitClassification 计算机科学, 2022, 49(6A): 759-763. https://doi.org/10.11896/jsjkx.211200148 |
[12] | 邵欣欣. TI-FastText自动商品分类算法 TI-FastText Automatic Goods Classification Algorithm 计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089 |
[13] | 陈景年. 一种适于多分类问题的支持向量机加速方法 Acceleration of SVM for Multi-class Classification 计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149 |
[14] | 杨健楠, 张帆. 一种结合双注意力机制和层次网络结构的细碎农作物分类方法 Classification Method for Small Crops Combining Dual Attention Mechanisms and Hierarchical Network Structure 计算机科学, 2022, 49(6A): 353-357. https://doi.org/10.11896/jsjkx.210200169 |
[15] | 庞兴龙, 朱国胜. 基于半监督学习的网络流量分析研究 Survey of Network Traffic Analysis Based on Semi Supervised Learning 计算机科学, 2022, 49(6A): 544-554. https://doi.org/10.11896/jsjkx.210600131 |
|