一种基于样本分层的双向过采样方法

doi:10.11896/jsjkx.190400053

Abstract

Abstract: Resampling technology has gradually become an important direction to solve the problem of classification for imbalanced data because of its simplicity and intuition.However,in the case of small data sets,under-sampling in resampling technology may lose important information of data sets,so oversampling is the focus of classification for imba-lanced data.Although the existing oversampling methods effectively overcome the imbalance between classes,they may cause dense areas of minority class to be denser,even lead to overlapping of samples.In addition,due to the noise of minority class,the existing oversampling methods may generate new samples around the noise,which makes the distribution of minority class more confusing.Aiming at these problems,this paper proposed a bi-directional oversampling method based on sample stratification.It firstly divides the minority samples into dense area and sparse area based on the highest density point and the intra-class average distance.And then the bi-directional oversampling is performed in the boundary region of dense area and the sparse area.In order to verify the effectiveness of the proposed algorithm,comprehensive experiments were conducted on 9 data sets of UCI database.The experimental results and Friedman test results show the superiority of the proposed algorithm for the task of imbalanced data classification.

Key words: Bi-directional oversampling, Classification, Dense area, Imbalanced data, Sparse area

CLC Number:

TP311

ZHOU Xiao-min, CAO Fu-yuan, YU Li-qin. Bi-directional Oversampling Method Based on Sample Stratification[J].Computer Science, 2019, 46(12): 83-88.

References

[1]HE H,GARCIA E A.Learning from imbalanced data [J].IEEE Transactions on Knowledge and Data Engineering,2009,21(9):1263-1284.
[2]ZHENG Z,WU X,SRIHARI R,et al.Feature selection for text categorization on imbalanced data [J].SIGKDD Explorations,2004,6(1):80-89.
[3]HUANG H,HE Q M,CHEN Q,et al.Rare category detection algorithm based on weighted boundary degree [J].Journal of Software,2012,23(5):1195-1208.(in Chinese)
黄浩,何钦铭,陈奇,等.基于加权边界度的稀有类检测算法[J].软件学报,2012,23(5):1195-1208.
[4]LOU X J,SUN Y X,LIU H T.Clustering boundary over-sampling classification method for imbalanced data sets [J].Journal of Zhejiang University (Engineering Science),2013,47(6):944-950.(in Chinese)
楼晓俊,孙雨轩,刘海涛.聚类边界过采样不平衡数据分类方法[J].浙江大学学报(工学版),2013,47(6):944-950.
[5]WANG H,ZHOU Z M.An over sampling algorithm based on clustering [J].Journal of Shandong University (Engineering Science),2018,48(3):134-139.(in Chinese)
王换,周忠眉.一种基于聚类的过抽样算法[J].山东大学学报(工学版),2018,48(3):134-139.
[6]WANG J H,DUAN B Q.Research on the SMOTE method based on density [J].CAAI Transactions on Intelligent Systems,2017(6):865-872.(in Chinese)
王俊红,段冰倩.一种基于密度的SMOTE方法研究[J].智能系统学报,2017(6):865-872.
[7]ZHU Y Q,DENG W B.A method using clustering and sampling approach for imbalance data [J].Journal of Nanjing University (Natural Sciences),2015,51(2):421-429.(in Chinese)
朱亚奇,邓维斌.一种基于不平衡数据的聚类抽样方法[J].南京大学学报(自然科学版),2015,51(2):421-429.
[8]YU Q,JIANG S J,ZHANG Y M,et al.The impact study of class imbalance on the performance of software defect prediction models [J].Chinese Journal of Computers,2018,41(4):809-824.(in Chinese)
于巧,姜淑娟,张艳梅,等.分类不平衡对软件缺陷预测模型性能的影响研究 [J].计算机学报,2018,41(4):809-824.
[9]LI X F,LI J,DONG Y F,et al.A new learning algorithm for imbalanced data—PCBoost [J].Chinese Journal of Computers,2012,35(2):202-209.(in Chinese)
李雄飞,李军,董元方,等.一种新的不平衡数据学习算法PCBoost [J].计算机学报,2012,35(2):202-209.
[10]JIN X,WANG L,SUN G Z,et al.Under-sampling method for unbalanced data based on centroid space [J].Computer Science,2019,46(2):50-55.(in Chinese)
金旭,王磊,孙国梓,等.一种基于质心空间的不均衡数据欠采样方法 [J].计算机科学,2019,46(2):50-55.
[11]BARUA S,ISLAM M M,YAO X,et al.MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning [J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):405-425.
[12]HE H,BAI Y,GARCIA E A,et al.ADASYN:adaptive synthe- tic sampling approach for imbalanced learning[C]//IEEE International Joint Conference on Neural Networks.IEEE Xplore,2008:1322-1328.
[13]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique [J].Journal of Artificial Intelligence Research,2011,16(1):321-357.
[14]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A new over-sampling method in imbalanced data sets learning [C]//International Conference on Intelligent Computing,Springer-Verlag Berlin Heidelberg,2005,3644(5):878-887.
[15]GEORGIOS D,FERNANDO B,FELIX L.Improving imba- lanced learning through a heuristic oversampling method based on k-means and smote [J].Information Sciences,2018,465:1-20.
[16]ZHANG X,SONG Q,WANG G,et al.A dissimilarity-based imbalance data classification algorithm [J].Applied Intelligence,2015,42(3):544-565.
[17]XU Y,YANG Z,ZHANG Y,et al.A maximum margin and mi- nimum volume hyper-spheres machine with pinball loss for imbalanced data classification [J].Knowledge-Based Systems,2016,95:75-85.
[18]NEKOOEIMEHR I,LAI-YUEN S K.Adaptive semi-unsuper- vised weighted oversampling (A-SUWO) for imbalanced datasets [J].Expert Systems with Applications,2016,46:405-416.
[19]ANWAR N,JONES G,GANESH S.Measurement of data complexity for classification problems with unbalanced data [J].Statistical Analysis and Data Mining,2014,7(3):194-211.
[20]DEMSAR J.Statistical comparisons of classifiers over multiple data sets [J].Journal of Machine Learning Research,2006,7(1):1-30.

Related Articles 15

[1]	CHEN Zhi-qiang, HAN Meng, LI Mu-hang, WU Hong-xin, ZHANG Xi-long. Survey of Concept Drift Handling Methods in Data Streams [J]. Computer Science, 2022, 49(9): 14-32.
[2]	ZHOU Xu, QIAN Sheng-sheng, LI Zhang-ming, FANG Quan, XU Chang-sheng. Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification [J]. Computer Science, 2022, 49(9): 132-138.
[3]	HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[4]	TAN Ying-ying, WANG Jun-li, ZHANG Chao-bo. Review of Text Classification Methods Based on Graph Convolutional Network [J]. Computer Science, 2022, 49(8): 205-216.
[5]	YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[6]	WU Hong-xin, HAN Meng, CHEN Zhi-qiang, ZHANG Xi-long, LI Mu-hang. Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning [J]. Computer Science, 2022, 49(8): 12-25.
[7]	GAO Zhen-zhuo, WANG Zhi-hai, LIU Hai-yang. Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features [J]. Computer Science, 2022, 49(7): 40-49.
[8]	YANG Bing-xin, GUO Yan-rong, HAO Shi-jie, Hong Ri-chang. Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition [J]. Computer Science, 2022, 49(7): 57-63.
[9]	ZHANG Hong-bo, DONG Li-jia, PAN Yu-biao, HSIAO Tsung-chih, ZHANG Hui-zhen, DU Ji-xiang. Survey on Action Quality Assessment Methods in Video Understanding [J]. Computer Science, 2022, 49(7): 79-88.
[10]	DU Li-jun, TANG Xi-lu, ZHOU Jiao, CHEN Yu-lan, CHENG Jian. Alzheimer's Disease Classification Method Based on Attention Mechanism and Multi-task Learning [J]. Computer Science, 2022, 49(6A): 60-65.
[11]	LI Xiao-wei, SHU Hui, GUANG Yan, ZHAI Yi, YANG Zi-ji. Survey of the Application of Natural Language Processing for Resume Analysis [J]. Computer Science, 2022, 49(6A): 66-73.
[12]	DENG Kai, YANG Pin, LI Yi-zhou, YANG Xing, ZENG Fan-rui, ZHANG Zhen-yu. Fast and Transmissible Domain Knowledge Graph Construction Method [J]. Computer Science, 2022, 49(6A): 100-108.
[13]	HUANG Shao-bin, SUN Xue-wei, LI Rong-sheng. Relation Classification Method Based on Cross-sentence Contextual Information for Neural Network [J]. Computer Science, 2022, 49(6A): 119-124.
[14]	LIN Xi, CHEN Zi-zhuo, WANG Zhong-qing. Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning [J]. Computer Science, 2022, 49(6A): 144-149.
[15]	KANG Yan, WU Zhi-wei, KOU Yong-qi, ZHANG Lan, XIE Si-yu, LI Hao. Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution [J]. Computer Science, 2022, 49(6A): 150-158.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Bi-directional Oversampling Method Based on Sample Stratification

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0