计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230400198-6.doi: 10.11896/jsjkx.230400198

• 大数据&数据科学 • 上一篇    下一篇

基于方差迁移的非平衡数据过采样方法

郑一凡, 王卯宁   

  1. 中央财经大学信息学院 北京 102206
  • 发布日期:2024-06-06
  • 通讯作者: 王卯宁(13854139297@139.com)
  • 作者简介:(zhengyf_cufe@163.com)
  • 基金资助:
    国家自然科学基金(61907042,61702570); 北京市自然科学基金(4194090);四川省教育厅人文社会科学重点研究基地科技金融与创业金融研究中心课题(JR2018-2)

Imbalanced Data Oversampling Method Based on Variance Transfer

ZHENG Yifan, WANG Maoning   

  1. School of Information,Central University of Finance and Economics,Beijing 102206,China
  • Published:2024-06-06
  • About author:ZHENG Yifan,born in 2000,postgra-duate.Her main research interests include fraud detection and imbalance data processing.
    WANG Maoning,born in 1987,Ph.D,professor,is a member of the CCF(No.93508M).Her main research interests include cryptography,blockchain and digital currency.
  • Supported by:
    National Natural Science Foundation of China(61907042,61702570),Beijing Natural Science Foundation(4194090) and Project of Research Center for Science and Technology Finance and Entrepreneurship Finance,Key Research Base of Humanities and Social Sciences,Sichuan Provincial Department of Education(JR2018-2).

摘要: 重采样是解决非平衡数据分类问题的重要方法。但在数据集很小的情况下,欠采样会丢失数据集的重要信息,因此过采样是非平衡数据分类问题的研究重点。现有的过采样方法虽然部分解决了类间不平衡问题,但是本质上并未给少数类引入额外的信息,且仍然存在着过拟合的风险。针对这些问题,提出了一种基于多数类方差迁移的少数类合成方法(Variance Transfer Oversampling,VTO),从足够多样化的多数类中提取样本偏移向量,综合少数类和多数类的特征权重矩阵以调整,最终将经过置信条件筛选的偏移向量叠加至少数类样本中心,从而在少数类样本生成中引入多数类方差,进而丰富少数类特征空间。为了验证所提算法的有效性,使用决策树为分类模型在6个KEEL数据集上训练,对比SMOTEENN等其他过采样方法,以F-score和PR-AUC值为评价指标进行了实验。结果显示,该算法在处理非平衡数据分类问题时具有更大优势。

关键词: 非平衡数据, 分类, 过采样, 方差迁移, 协方差

Abstract: Resampling is an important method to solve imbalanced data classification problem.However,when the size of data set is very small,undersampling will lose important information of the data set,so oversampling is the research focus of imbalanced data classification.Although the existing oversampling methods partially solve the problem of imbalance between classes,they essentially do not introduce additional information to minority class,and there is still a risk of overfitting.To solve these problems,VTO,an oversampling method based on variance migration of the majority class,is proposed in this paper.In this method,a shift vector is extracted from majority class,and the feature weight matrix of the minority class and the majority class is used for adjustment.Furthermore,the shift vectors filtered by the confidence conditions are superimposed to the center of the minority class,so as to introduce the majority class variance in the generation process of new minority class samples,then enrich the minority class feature space.In order to verify the effectiveness of the proposed algorithm,decision tree is used as classification model to train on 6 KEEL data sets.Compared with SMOTEENN and other over-sampling methods,with F-score and PR-AUC values as evaluation indexes,the results show that VTO is more advantageous in dealing with imbalanced data classification.

Key words: Imbalanced data, Classification, Oversampling, Variance transfer, Covariance

中图分类号: 

  • TP311
[1]ZHENG Y,WANG M.Imbalanced problem in initial coin offe-ring fraud detection[C]//Proceedings of the Data Science.Singapore,2022.
[2]CHEN L,XU G,ZHANG Q,et al.Learning deep representation of imbalanced SCADA data for fault detection of wind turbines[J].Measurement,2019,139.
[3]ZENG M,ZOU B,WEI F,et al.Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data[C]//IEEE International Conference of Online Analysis.2016:225-228.
[4]LIN X,CHEN Z,WANG Z.Aspect-level sentiment classification based on imbalanced data and ensemble learning[J].Computer Science,2022,49(S1):144-149.
[5]GUZMÁN-PONCE A,SÁNCHEZ J S,VALDOVINOS R M,et al.DBIG-US:A two-stage under-sampling algorithm to face the class imbalance problem[J].Expert Systems with Applications,2021,168:114301.
[6]JIN X,WANG L,SUN G,et al.Under-sampling Method forUnbalanced Data Based on Centroid Space[J].Computer Science,2019,46(2):50-55.
[7]KHUSHI M,SHAUKAT K,ALAM T M,et al.A Comparative Performance Analysis of Data Resampling Methods on Imba-lance Medical Data[J].IEEE Access,2021,9:109960-109975.
[8]CHAWLA N,BOWYER K,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].arXiv:1106.1813,2011.
[9]BARUA S,ISLAM M M,YAO X,et al.MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning[J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):405-425.
[10]ASNIAR,MAULIDEVI N U,SURENDRO K.SMOTE-LOFfor noise identification in imbalanced data classification[J].Journal of King Saud University-Computer and Information Science,2022,34(6):3413-3423.
[11]HAIRANI H,SAPUTRO K E,FADLI S.K-means-SMOTE untuk menangani ketidakseimbangan kelas dalam klasifikasi penyakit diabetes dengan C4.5,SVM,dan naive Bayes[J].Jurnal Teknologi dan Sistem Komputer.2020:5.
[12]ZHOU X,CAO F,YU L.Bi-directional oversampling method based on sample stratification[J].Computer Science,2019,46(12):83-88.
[13]ZHAO K,JIN X,WANG Y.Survey on few-shot learning[J].Journal of Software,2021,32(2):349-69.
[14]LIU J,SUN Y,HAN C,et al.Deep representation learning on long-tailed data:A learnable embedding augmentation perspective[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020.
[15]ALCALÁ-FDEZ J,FERNÁNDEZ A,LUENGO J,et al.KEEL Data-Mining Software Tool:Data Set Repository[J].Integration of Algorithms and Experimental Analysis Framework,2011,17:255-287.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!