计算机科学 ›› 2025, Vol. 52 ›› Issue (9): 220-231.doi: 10.11896/jsjkx.241000010

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于无噪梯度分布的合成过采样方法

胡立彬1, 张云峰2, 刘培德3   

  1. 1 山东财经大学管理科学与工程学院 济南 250014
    2 山东财经大学计算机科学与技术学院 济南 250014
    3 山东财经大学山东省区块链金融重点实验室 济南 250014
  • 收稿日期:2024-10-08 修回日期:2025-02-15 出版日期:2025-09-15 发布日期:2025-09-11
  • 通讯作者: 张云峰(yfzhang@sdufe.edu.cn)
  • 作者简介:(hlblydx@163.com)
  • 基金资助:
    山东省自然科学基金(ZR2022MF245);山东省重点研发计划(2023CXPT033)

Synthetic Oversampling Method Based Noiseless Gradient Distribution

HU Libin1, ZHANG Yunfeng2, LIU Peide3   

  1. 1 School of Management Science and Engineering,Shandong University of Finance and Economics,Jinan 250014,China
    2 School of Computer Science and Technology,Shandong University of Finance and Economics,Jinan 250014,China
    3 Shandong Key Laboratory of Blockchain Finance,Shandong University of Finance and Economics,Jinan 250014,China
  • Received:2024-10-08 Revised:2025-02-15 Online:2025-09-15 Published:2025-09-11
  • About author:HU Libin,born in 1990,Ph.D,is a member of CCF(No.V6549G).His main research interests include data mining,artificial intelligence and financial intelligence risk controll.
    ZHANG Yunfeng,born in 1977,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.19888M).His main research interests include graphics,artificial intelligence,data mining and visua-lization.
  • Supported by:
    Natural Science Foundation of Shandong Province(ZR2022MF245) and Key R&D Program of Shandong Province(2023CXPT033).

摘要: 合成过采样方法(Synthetic Oversampling Method)是解决不平衡分类问题的重要手段,但当前的合成过采样方法在处理高维不平衡分类问题时仍面临诸多挑战。针对当前合成过采样方法未考虑噪声样本造成的误差累积、对样本空间距离过度依赖、合成样本的分布牺牲负类样本识别精度这3个问题,提出一种基于无噪梯度分布的合成过采样方法。首先,利用样本的梯度贡献属性作为度量样本标签置信度的指标并过滤数据集中的噪声标签样本,避免了噪声样本作为根样本造成的误差累积。其次,根据梯度贡献指标和安全梯度阈值将正类样本分配到不同的梯度区间,并选择安全梯度区间内的样本作为根样本,根样本的梯度右近邻作为辅助样本,不仅摆脱了对空间距离度量的依赖,而且保证了决策边界不断往负类样本移动。最后,设计了基于余弦相似度的安全梯度分布近似策略,用于计算每个安全梯度区间内需要生成的样本数量,该策略合成后的样本分布可以使决策边界以安全的方式向负类样本移动,因此不会明显牺牲负类样本的识别精度。在来自KEEL,UCI和Kaggle平台的数据集上的实验表明,所提出的算法在提升分类器Recall值的同时,也可以获得很好的F1-Score,G-Mean和MCC值。

关键词: 梯度贡献, 无噪梯度, 梯度右近邻, 安全梯度分布近似, 合成过采样

Abstract: Synthetic Oversampling Method is an important means to solve imbalanced classification problem,but the current oversampling methods still have many problems when dealing with high-dimensional imbalanced classification problem.A synthetic oversampling method based on noiseless gradient distribution is proposed to address the three issues of error accumulation caused by noise samples,excessive dependence on sample space distance,and reduced recognition accuracy of negative class samples in current synthetic oversampling methods.Firstly,the gradient contribution attribute of the sample is used as the metric to mea-sure the label confidence of the sample and the noise label samples in the data set are filtered to avoid the error accumulation caused by the noise samples as the root samples.Secondly,the positive samples are assigned to different gradient intervals accor-ding to the gradient contribution metric and the safe gradient threshold,the samples in the safe gradient interval are selected as the root samples,and the gradient right nearest neighbor of the root sample are regarded as the auxiliary samples,which not only gets rid of the dependence on spatial distance measurement,but also ensures that the decision boundary moved to the negative class samples continuously.Finally,a safe gradient distribution approximation strategy based on cosine similarity is designed to calculate the number of samples to be generated in each safe gradient interval,and the synthesized sample distribution by which can make the decision boundary moved toward the negative class samples in a safe way,so the recognition accuracy of the negative class samples will not be significantly sacrificed.Experiments on datasets from KEEL,UCI and Kaggle platforms show that the proposed algorithm can not only improve the Recall value of the classifier,but also obtain satisfactory F1-Score,G-Mean and MCC values.

Key words: Gradient contribution, Noiseless gradient, Gradient right neighbor, Safe gradient distribution approximation, Synthetic oversampling

中图分类号: 

  • TP181
[1]TIAN Y,BIAN B,TANG X F,et al.A new non-kernel quadra-tic surface approach for imbalanced data classification in online credit scoring[J].Information Science,2021,563:150-165.
[2]CHARIZANOS G,DEMIRHAN H,ICEN D.An online fuzzy fraud detection framework for credit card transactions[J].Expert Systems With Applications,2024,252(PA):124127.
[3]REB H J,TANG Y H,DONG W Y,et al.Dynamic ensemble handling class imbalance in network intrusion detection[J].Expert Systems With Applications,2023,229(PA):120420.
[4]WANG C J,XIN C,XU Z L.A novel deep metric learning model for imbalanced fault diagnosis and toward open-set classification[J].Knowledge-Based Systems,2021,220:106925.
[5]BARUA S,ISIAM M M,YAO X,et al.MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning[J].IEEE Transactions on Knowledge and Data Engineering,2024,26(2):405-425.
[6]NEKOOEIMEHR I,SUSANA K,LAI Y.Adaptive semi-unsupervised weighted oversampling(A-SUWO) for imbalanced datasets[J].Expert Systems With Applications,2016,46:405-416.
[7]WANG X X,LI L X,LIN H.A Review of SMOTE Algorithm Research[J].Journal of Frontiers of Computer Science & Technology,2024,18(5):1135-1159.
[8]CHAWLA N,BOWYER W K,HAALL O L,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].The Journal of Artificial Intelligence Research,2002,16:321-357.
[9]HE H B,BAI Y,GARCIA E A,et al.ADASYN:Adaptive synthetic sampling approach for imbalanced learning[C]//2008 IEEE International Joint Conference on Neural Networks.IEEE World Congress on Computational Intelligence.2008:1322-1328.
[10]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:A NewOver-Sampling Method in Imbalanced Data Sets Learning[C]//Lecture Notes in Computer Science.2005:878-887.
[11]NGUYEN H M,COOPER E W,KAMEI K.Borderline over-sampling for imbalanced data classification[J].International Journal of Knowledge Engineering and Soft Data Paradigms,2011,3(1):4-21.
[12]BUNKHUMPORNPAT C,SINAPIROMSARA K,LURSINASAP C.Safe-Level-SMOTE:Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem[C]//13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining.2009:475-482.
[13]ZHENG J,QU H C,LI Z N,et al.A novel autoencoder approach to feature extraction with linear separability for high-dimensio-nal data[J].PeerJ Computer Science,2022,8:e1061.
[14]LI B Y,LIU Y,WANG X G.Gradient Harmonized Single-Stage Detector[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:8577-8584.
[15]CHEN Y Q,PEDRYCZ W,YANG J.A new boundary-degreebased oversampling method for imbalanced data[J].Applied Intelligence,2023,53(22):26518-26541.
[16]LI J N,ZHU Q S,WU Q W,et al.A novel oversampling technique for class-imbalanced learning based on SMOTE and natural neighbors[J].Information Science,2021,565:438-455.
[17]WANG W T,YANG L J,ZHANG J H,et al.Natural local density-based adaptive oversampling algorithm for imbalanced classification[J].Knowledge-Based Systems,2024,295:111845.
[18]LI M,ZHOU H,LIU Q,et al.WRND:A weighted oversampling framework with relative neighborhood density for imbalanced noisy classification[J].Expert Systems With Applications,2024,241:122593.
[19]LENG Q K,GUO J M,JIAO E J,et al.NanBDOS:Adaptive and parameter-free borderline oversampling via natural neighbor search for class-imbalance learning[J].Knowledge-Based Systems,2023,274:110665.
[20]YAN Y T,JIANG Y F,ZHENG Z,et al.LDAS:Local density based adaptive sampling for imbalanced data classification[J].Expert Systems with Applications,2022,191:116213.
[21]TAO X,ZHANG X,ZHENG Y,et al.A Mean Shift-guidedoversampling with self-adaptive sizes for imbalanced data classification[J].Information Science,2024,672:120699.
[22]ZHANG Z,TIAN H P,JIN J S.Multiple adaptive over-sampling for imbalanced data evidential classification[J].Engineering Applications of Artificial Intelligence,2024,133(F):108532.
[23]SUN L,LI M M,DING W P,et al.AFNFS:Adaptive fuzzyneighborhood-based feature selection with adaptive synthetic over-sampling for imbalanced data[J].Information Science,2022,612:724-744.
[24]MOUTAOUAKIL K,ROUDANI M,QUISSARI A.OptimalEntropy Genetic Fuzzy-C-Means SMOTE(OEGFCM-SMOTE)[J].Knowledge-Based Systems,2023,262:110235.
[25]MENG D X,LI Y J.An imbalanced learning method by combining SMOTE with Center Offset Factor[J].Applied Soft Computing,2022,120:108618.
[26]WANG X L,GONG J,SONG Y,et al.Adaptively weightedthree-way decision oversampling:A cluster imbalanced-ratio based approach[J].Applied Intelligence,2022,53(1):312-335.
[27]XU Z Z,SHEN D R,KOU Y,et al.A Synthetic Minority Oversampling Technique Based on Gaussian Mixture Model Filtering for Imbalanced Data Classification[J].IEEE Transactions on Neural Networks and Learning Systems,2024,35(3):3740-3753.
[28]LI J N,ZHU Q S,WU Q W,et al.SMOTE-NaN-DE:Addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution[J].Knowledge-Based Systems,2021,223:107056.
[29]PARK S,LEE H,IM J.Relabeling & raking algorithm for imbalanced classification[J].Expert Systems With Applications,2024,247:123274.
[30]LIU R J.A novel synthetic minority oversampling techniquebased on relative and absolute densities for imbalanced classification[J].Applied Intelligence,2023,53(1):786-803.
[31]ZHENG Y F,WANG M N.Oversampling Method for imba-lanced Data based on Variance Transfer[J].Computer Science,2024,51(S1):657-662.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!