计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 83-88.doi: 10.11896/jsjkx.190400053

• 大数据与数据科学 • 上一篇    下一篇

一种基于样本分层的双向过采样方法

周晓敏, 曹付元, 余丽琴   

  1. (山西大学计算机与信息技术学院 太原030006);
    (山西大学计算智能与中文信息处理教育部重点实验室 太原030006)
  • 收稿日期:2019-04-09 出版日期:2019-12-15 发布日期:2019-12-17
  • 通讯作者: 曹付元(1974-),男,博士,教授,博士生导师,CCF会员,主要研究方向为数据挖掘与机器学习,E-mail:cfy@sxu.edu.cn。
  • 作者简介:周晓敏(1995-),女,硕士生,主要研究方向为非平衡数据分类学习,E-mail:2641401859@qq.com;余丽琴(1992-),女,博士生,主要研究方向为数据挖掘与机器学习。
  • 基金资助:
    本文受国家自然科学基金项目(61573229),山西省重点研发计划项目(201803D31022),山西省留学基金项目(2016-003),山西省留学基金择优资助项目(2016-001)资助。

Bi-directional Oversampling Method Based on Sample Stratification

ZHOU Xiao-min, CAO Fu-yuan, YU Li-qin   

  1. (School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China);
    (Key Laboratory of Computational Intelligence and Chinese Information Processing(Shanxi University),Ministry of Education,Taiyuan 030006,China)
  • Received:2019-04-09 Online:2019-12-15 Published:2019-12-17

摘要: 重采样技术由于简单、直观,逐渐成为解决非平衡数据分类问题的一个重要方向。但是在数据集很小的情况下,重采样技术中的欠采样可能会丢失数据集的重要信息,因此过采样是非平衡数据分类问题的研究重点。现有的过采样方法虽然有效地解决了类间不平衡问题,但是有可能造成少数类的密集区域更加密集,甚至引起样本重叠。此外,由于少数类样本可能存在噪音,现有的过采样方法可能会在噪音周围生成新样本,从而造成少数类样本的分布更加混乱。针对这些问题,文中提出了一种基于样本分层的双向过采样方法,该方法首先基于最高密度点和类内平均距离将少数类样本划分成密集层和稀疏层,然后对密集层边界区样本和稀疏层的样本进行双向过采样。为了验证所提算法的有效性,在9个UCI数据集上将提出的算法和其他过采样算法进行了比较。实验结果和Friedman等检验结果显示,提出的算法在处理非平衡数据分类问题时具有一定优势。

关键词: 非平衡数据, 分类, 密集层, 双向过采样, 稀疏层

Abstract: Resampling technology has gradually become an important direction to solve the problem of classification for imbalanced data because of its simplicity and intuition.However,in the case of small data sets,under-sampling in resampling technology may lose important information of data sets,so oversampling is the focus of classification for imba-lanced data.Although the existing oversampling methods effectively overcome the imbalance between classes,they may cause dense areas of minority class to be denser,even lead to overlapping of samples.In addition,due to the noise of minority class,the existing oversampling methods may generate new samples around the noise,which makes the distribution of minority class more confusing.Aiming at these problems,this paper proposed a bi-directional oversampling method based on sample stratification.It firstly divides the minority samples into dense area and sparse area based on the highest density point and the intra-class average distance.And then the bi-directional oversampling is performed in the boundary region of dense area and the sparse area.In order to verify the effectiveness of the proposed algorithm,comprehensive experiments were conducted on 9 data sets of UCI database.The experimental results and Friedman test results show the superiority of the proposed algorithm for the task of imbalanced data classification.

Key words: Bi-directional oversampling, Classification, Dense area, Imbalanced data, Sparse area

中图分类号: 

  • TP311
[1]HE H,GARCIA E A.Learning from imbalanced data [J].IEEE Transactions on Knowledge and Data Engineering,2009,21(9):1263-1284.
[2]ZHENG Z,WU X,SRIHARI R,et al.Feature selection for text categorization on imbalanced data [J].SIGKDD Explorations,2004,6(1):80-89.
[3]HUANG H,HE Q M,CHEN Q,et al.Rare category detection algorithm based on weighted boundary degree [J].Journal of Software,2012,23(5):1195-1208.(in Chinese)
黄浩,何钦铭,陈奇,等.基于加权边界度的稀有类检测算法[J].软件学报,2012,23(5):1195-1208.
[4]LOU X J,SUN Y X,LIU H T.Clustering boundary over-sampling classification method for imbalanced data sets [J].Journal of Zhejiang University (Engineering Science),2013,47(6):944-950.(in Chinese)
楼晓俊,孙雨轩,刘海涛.聚类边界过采样不平衡数据分类方法[J].浙江大学学报(工学版),2013,47(6):944-950.
[5]WANG H,ZHOU Z M.An over sampling algorithm based on clustering [J].Journal of Shandong University (Engineering Science),2018,48(3):134-139.(in Chinese)
王换,周忠眉.一种基于聚类的过抽样算法[J].山东大学学报(工学版),2018,48(3):134-139.
[6]WANG J H,DUAN B Q.Research on the SMOTE method based on density [J].CAAI Transactions on Intelligent Systems,2017(6):865-872.(in Chinese)
王俊红,段冰倩.一种基于密度的SMOTE方法研究[J].智能系统学报,2017(6):865-872.
[7]ZHU Y Q,DENG W B.A method using clustering and sampling approach for imbalance data [J].Journal of Nanjing University (Natural Sciences),2015,51(2):421-429.(in Chinese)
朱亚奇,邓维斌.一种基于不平衡数据的聚类抽样方法[J].南京大学学报(自然科学版),2015,51(2):421-429.
[8]YU Q,JIANG S J,ZHANG Y M,et al.The impact study of class imbalance on the performance of software defect prediction models [J].Chinese Journal of Computers,2018,41(4):809-824.(in Chinese)
于巧,姜淑娟,张艳梅,等.分类不平衡对软件缺陷预测模型性能的影响研究 [J].计算机学报,2018,41(4):809-824.
[9]LI X F,LI J,DONG Y F,et al.A new learning algorithm for imbalanced data—PCBoost [J].Chinese Journal of Computers,2012,35(2):202-209.(in Chinese)
李雄飞,李军,董元方,等.一种新的不平衡数据学习算法PCBoost [J].计算机学报,2012,35(2):202-209.
[10]JIN X,WANG L,SUN G Z,et al.Under-sampling method for unbalanced data based on centroid space [J].Computer Science,2019,46(2):50-55.(in Chinese)
金旭,王磊,孙国梓,等.一种基于质心空间的不均衡数据欠采样方法 [J].计算机科学,2019,46(2):50-55.
[11]BARUA S,ISLAM M M,YAO X,et al.MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning [J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):405-425.
[12]HE H,BAI Y,GARCIA E A,et al.ADASYN:adaptive synthe- tic sampling approach for imbalanced learning[C]//IEEE International Joint Conference on Neural Networks.IEEE Xplore,2008:1322-1328.
[13]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique [J].Journal of Artificial Intelligence Research,2011,16(1):321-357.
[14]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A new over-sampling method in imbalanced data sets learning [C]//International Conference on Intelligent Computing,Springer-Verlag Berlin Heidelberg,2005,3644(5):878-887.
[15]GEORGIOS D,FERNANDO B,FELIX L.Improving imba- lanced learning through a heuristic oversampling method based on k-means and smote [J].Information Sciences,2018,465:1-20.
[16]ZHANG X,SONG Q,WANG G,et al.A dissimilarity-based imbalance data classification algorithm [J].Applied Intelligence,2015,42(3):544-565.
[17]XU Y,YANG Z,ZHANG Y,et al.A maximum margin and mi- nimum volume hyper-spheres machine with pinball loss for imbalanced data classification [J].Knowledge-Based Systems,2016,95:75-85.
[18]NEKOOEIMEHR I,LAI-YUEN S K.Adaptive semi-unsuper- vised weighted oversampling (A-SUWO) for imbalanced datasets [J].Expert Systems with Applications,2016,46:405-416.
[19]ANWAR N,JONES G,GANESH S.Measurement of data complexity for classification problems with unbalanced data [J].Statistical Analysis and Data Mining,2014,7(3):194-211.
[20]DEMSAR J.Statistical comparisons of classifiers over multiple data sets [J].Journal of Machine Learning Research,2006,7(1):1-30.
[1] 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙.
数据流概念漂移处理方法研究综述
Survey of Concept Drift Handling Methods in Data Streams
计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112
[2] 周旭, 钱胜胜, 李章明, 方全, 徐常胜.
基于对偶变分多模态注意力网络的不完备社会事件分类方法
Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification
计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022
[3] 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航.
监督和半监督学习下的多标签分类综述
Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning
计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[4] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[5] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[6] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[7] 高振卓, 王志海, 刘海洋.
嵌入典型时间序列特征的随机Shapelet森林算法
Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features
计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[8] 杨炳新, 郭艳蓉, 郝世杰, 洪日昌.
基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用
Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition
计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070
[9] 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥.
视频理解中的动作质量评估方法综述
Survey on Action Quality Assessment Methods in Video Understanding
计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028
[10] 黄璞, 沈阳阳, 杜旭然, 杨章静.
基于局部约束特征线表示的人脸识别
Face Recognition Based on Locality Constrained Feature Line Representation
计算机科学, 2022, 49(6A): 429-433. https://doi.org/10.11896/jsjkx.210300169
[11] 杨涵, 万游, 蔡洁萱, 方铭宇, 吴卓超, 金扬, 钱伟行.
基于步态分类辅助的虚拟IMU的行人导航方法
Pedestrian Navigation Method Based on Virtual Inertial Measurement Unit Assisted by GaitClassification
计算机科学, 2022, 49(6A): 759-763. https://doi.org/10.11896/jsjkx.211200148
[12] 邵欣欣.
TI-FastText自动商品分类算法
TI-FastText Automatic Goods Classification Algorithm
计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[13] 陈景年.
一种适于多分类问题的支持向量机加速方法
Acceleration of SVM for Multi-class Classification
计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149
[14] 杨健楠, 张帆.
一种结合双注意力机制和层次网络结构的细碎农作物分类方法
Classification Method for Small Crops Combining Dual Attention Mechanisms and Hierarchical Network Structure
计算机科学, 2022, 49(6A): 353-357. https://doi.org/10.11896/jsjkx.210200169
[15] 庞兴龙, 朱国胜.
基于半监督学习的网络流量分析研究
Survey of Network Traffic Analysis Based on Semi Supervised Learning
计算机科学, 2022, 49(6A): 544-554. https://doi.org/10.11896/jsjkx.210600131
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!