计算机科学 ›› 2023, Vol. 50 ›› Issue (10): 48-58.doi: 10.11896/jsjkx.230600022

• 粒计算与知识发现 • 上一篇    下一篇

基于构造性神经网络与全局密度信息的不平衡数据欠采样方法

严远亭, 马迎澳, 任艳平, 张燕平   

  1. 安徽大学计算机科学与技术学院 合肥230601
  • 收稿日期:2023-06-02 修回日期:2023-08-08 出版日期:2023-10-10 发布日期:2023-10-10
  • 通讯作者: 严远亭(ytyan@ahu.edu.cn)
  • 基金资助:
    国家自然科学基金(61806002)

Imbalanced Undersampling Based on Constructive Neural Network and Global Density Information

YAN Yuanting, MA Yingao, REN Yanping, ZHANG Yanping   

  1. College of Computer Science and Technology,Anhui University,Hefei 230601,China
  • Received:2023-06-02 Revised:2023-08-08 Online:2023-10-10 Published:2023-10-10
  • About author:YAN Yuanting,born in 1986,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include data mining,machine learning and granular computing.
  • Supported by:
    National Natural Science Foundation of China(61806002).

摘要: 多数类欠采样是当前数据层面解决不平衡数据学习的主流技术之一,近年来,研究者们提出了一系列的欠采样方法,但大多都将重点放在如何选择代表性的样本,从而降低信息损失。然而,如何在欠采样过程中保持多数类内部的结构信息,仍然是欠采样面临的主要挑战。针对该挑战,提出了一种基于构造性神经网络和全局分布密度的不平衡数据集欠采样方法。该方法首先基于构造性神经网络,设计了一种多数类局部模式的学习方法;然后基于多数类局部模式,设计了两种具有结构保持特性的样本选择策略;最后针对局部模式学习的随机性可能导致的采样结果非优的问题,进一步引入了bagging集成策略,提升了方法的性能。在59个数据集上与13种对比方法进行了对比实验,验证了所提方法在G-mean,AUC和F1-score这3个常用指标上的有效性。

关键词: 欠采样, 不平衡数据, 分布密度, 构造性神经网络, 集成学习

Abstract: Undersampling is one of the mainstream data-level technologies to deal with imbalanced data.In recent years,researchers have proposed numerous undersampling methods,but most of them focus on how to select representative majority class samples to avoid the loss of informative data.However,how to maintain the structures of the original majority class in the process of undersampling is still an open challenge.To this end,an undersampling method for imbalanced data classification is proposed based on constructive neural network and data density.Firstly,it detects the majority local patterns with a simplified constructive process.Then,two sample selection strategies are designed to maintain the structure of the selected groups according to the original majority distribution information.Finally,to solve the problem that the randomness of local pattern learning may lead to non-optimal sampling results,the bagging technique is introduced to further improve the learning performance.Comparative experiments with 13 comparison methodson 59 datasets verify the effectiveness of the proposed method in terms of three metrics G-mean,AUC and F1-score.

Key words: Undersampling, Imbalanced data, Distribution density, Constructive neural network, Ensemble learning

中图分类号: 

  • TP311
[1]CHAMSEDDINE E,MANSOURI N,SOUI M,et al.Handling class imbalance in COVID-19 chest X-ray images classification:Using SMOTE and weighted loss[J].Applied Soft Computing,2022,129:109588.
[2]GUO J F,WANG M S,SUN L,et al.New method of fault diagnosis for rolling bearing imbalance data set based on generative adversarial network[J].Computer Integrated Manufacturing Systems,2022,28(9):2825-2835.
[3]CHEN Z,ZHU M,DU J W.Multi-view graph neural network for fraud detection algorithm[J].Journal on Communications,2022,43(11):225-232.
[4]XIE Y X,QIU M,ZHANG H B,et al.Gaussian distributionbased oversampling for imbalanced data classification[J].IEEE Transactions on Knowledge and Data Engineering,2022,34(2):667-679.
[5]LIN W C,TSAI C F,HU Y H,et al.Clustering-based undersampling in class-imbalanced data[J].Information Sciences,2017,409:17-26.
[6]ZHANG Y Q,LU R Z,QIAO S J,et al.A Sampling Method of Imbalanced Data Based on Sample Space[J].Acta Automatica Sinica,2022,48(10):2549-2563.
[7]DONG H C,WEN Z Y,WAN Y H,et al.An imbalanced dataclassification algorithm based on DPC clustering resampling combined with ELM[J].Computer Engineering & Science,2021,43(10):1856-1863.
[8]DRUMMOND C,HOLTE R C.C4.5,class imbalance,and cost sensitivity:why under-sampling beats over-sampling[C]//Workshop on learning from imbalanceddatasets II.2003:1-8.
[9]WANG S,MINKU L L,YAO X.Resampling-based ensemblemethods for online class imbalance learning[J].IEEE Transactions on Knowledge and Data Engineering,2014,27(5):1356-1368.
[10]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16:321-357.
[11]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a newover-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing.Berlin:Sprin-ger,2005:878-887.
[12]HE H B,BAI Y,GARCIA E A,et al.ADASYN:Adaptive synthetic sampling approach for imbalanced learning[C]//2008 IEEE International Joint Conference on Neural Networks(IEEE world Congress on Computational Intelligence).IEEE,2008:1322-1328.
[13]BARUA S,ISLAM M M,YAO X,et al.MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning[J].IEEE Transactions on Knowledge and Data Engineering,2012,26(2):405-425.
[14]BUNKHUMPORNPAT C,SINAPIROMSARAN K,LURSIN-SAP C.Safe-level-smote:Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem[C]//Advances in Knowledge Discovery and Data Mining:13th Pacific-Asia Conference.Berlin:Springer,2009:475-482.
[15]KOZIARSKI M.Radial-based undersampling for imbalanced data classification[J].Pattern Recognition,2020,102:107262.
[16]ZHANG Y P,ZHANG L,WANG Y C.Cluster-based majority under-sampling approaches for class imbalance learning[C]//2010 2nd IEEE International Conference on Information and Financial Engineering.IEEE,2010:400-404.
[17]BARANDELA R,VALDOVINOS R M,SÁNCHEZ J S.Newapplications of ensembles of classifiers[J].Pattern Analysis & Applications,2003,6:245-256.
[18]LIU X Y,WU J X,ZHOU Z H.Exploratory undersampling for class-imbalance learning[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B(Cybernetics),2008,39(2):539-550.
[19]SUN Y M,KAMEL M S,WONG A K C,et al.Cost-sensitiveboosting for classification of imbalanced data[J].Pattern Recognition,2007,40(12):3358-3378.
[20]ZHU Z H,WANG Z,LI D D,et al.Geometric structural ensemble learning for imbalanced problems[J].IEEE Transactions on cybernetics,2018,50(4):1617-1629.
[21]ZHANG L,ZHANG B.A geometrical representation of McCulloch-Pitts neural model and its applications[J].IEEE Transactions on Neural Networks,1999,10(4):925-929.
[22]EFRAIMIDIS P S,SPIRAKIS P G.Weighted random sampling with a reservoir[J].Information Processing Letters,2006,97(5):181-185.
[23]VUTTIPITTAYAMONGKOL P,ELYAN E.Neighbourhood-based undersampling approach for handling imbalanced and overlapped data[J].Information Sciences,2020,509:47-70.
[24]SEIFFERT C,KHOSHGOFTAAR T M,VAN HULSE J,et al.RUSBoost:A hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems,Man,and Cybernetics-Part A:Systems and Humans,2009,40(1):185-197.
[25]GALAR M,FERNÁNDEZ A,BARRENECHEA E,et al.EUSBoost:Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling[J].Pattern recognition,2013,46(12):3460-3471.
[26]BĿASZCZYHSKI J,DECKERT M,STEFANOWSKI J,et al.IIvotes ensemble for imbalanced data[J].Intelligent Data Analysis,2012,16(5):777-801.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!