计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230400198-6.doi: 10.11896/jsjkx.230400198
郑一凡, 王卯宁
ZHENG Yifan, WANG Maoning
摘要: 重采样是解决非平衡数据分类问题的重要方法。但在数据集很小的情况下,欠采样会丢失数据集的重要信息,因此过采样是非平衡数据分类问题的研究重点。现有的过采样方法虽然部分解决了类间不平衡问题,但是本质上并未给少数类引入额外的信息,且仍然存在着过拟合的风险。针对这些问题,提出了一种基于多数类方差迁移的少数类合成方法(Variance Transfer Oversampling,VTO),从足够多样化的多数类中提取样本偏移向量,综合少数类和多数类的特征权重矩阵以调整,最终将经过置信条件筛选的偏移向量叠加至少数类样本中心,从而在少数类样本生成中引入多数类方差,进而丰富少数类特征空间。为了验证所提算法的有效性,使用决策树为分类模型在6个KEEL数据集上训练,对比SMOTEENN等其他过采样方法,以F-score和PR-AUC值为评价指标进行了实验。结果显示,该算法在处理非平衡数据分类问题时具有更大优势。
中图分类号:
[1]ZHENG Y,WANG M.Imbalanced problem in initial coin offe-ring fraud detection[C]//Proceedings of the Data Science.Singapore,2022. [2]CHEN L,XU G,ZHANG Q,et al.Learning deep representation of imbalanced SCADA data for fault detection of wind turbines[J].Measurement,2019,139. [3]ZENG M,ZOU B,WEI F,et al.Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data[C]//IEEE International Conference of Online Analysis.2016:225-228. [4]LIN X,CHEN Z,WANG Z.Aspect-level sentiment classification based on imbalanced data and ensemble learning[J].Computer Science,2022,49(S1):144-149. [5]GUZMÁN-PONCE A,SÁNCHEZ J S,VALDOVINOS R M,et al.DBIG-US:A two-stage under-sampling algorithm to face the class imbalance problem[J].Expert Systems with Applications,2021,168:114301. [6]JIN X,WANG L,SUN G,et al.Under-sampling Method forUnbalanced Data Based on Centroid Space[J].Computer Science,2019,46(2):50-55. [7]KHUSHI M,SHAUKAT K,ALAM T M,et al.A Comparative Performance Analysis of Data Resampling Methods on Imba-lance Medical Data[J].IEEE Access,2021,9:109960-109975. [8]CHAWLA N,BOWYER K,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].arXiv:1106.1813,2011. [9]BARUA S,ISLAM M M,YAO X,et al.MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning[J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):405-425. [10]ASNIAR,MAULIDEVI N U,SURENDRO K.SMOTE-LOFfor noise identification in imbalanced data classification[J].Journal of King Saud University-Computer and Information Science,2022,34(6):3413-3423. [11]HAIRANI H,SAPUTRO K E,FADLI S.K-means-SMOTE untuk menangani ketidakseimbangan kelas dalam klasifikasi penyakit diabetes dengan C4.5,SVM,dan naive Bayes[J].Jurnal Teknologi dan Sistem Komputer.2020:5. [12]ZHOU X,CAO F,YU L.Bi-directional oversampling method based on sample stratification[J].Computer Science,2019,46(12):83-88. [13]ZHAO K,JIN X,WANG Y.Survey on few-shot learning[J].Journal of Software,2021,32(2):349-69. [14]LIU J,SUN Y,HAN C,et al.Deep representation learning on long-tailed data:A learnable embedding augmentation perspective[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020. [15]ALCALÁ-FDEZ J,FERNÁNDEZ A,LUENGO J,et al.KEEL Data-Mining Software Tool:Data Set Repository[J].Integration of Algorithms and Experimental Analysis Framework,2011,17:255-287. |
|