Computer Science ›› 2024, Vol. 51 ›› Issue (6A): 230400198-6.doi: 10.11896/jsjkx.230400198

• Big Data & Data Science • Previous Articles     Next Articles

Imbalanced Data Oversampling Method Based on Variance Transfer

ZHENG Yifan, WANG Maoning   

  1. School of Information,Central University of Finance and Economics,Beijing 102206,China
  • Published:2024-06-06
  • About author:ZHENG Yifan,born in 2000,postgra-duate.Her main research interests include fraud detection and imbalance data processing.
    WANG Maoning,born in 1987,Ph.D,professor,is a member of the CCF(No.93508M).Her main research interests include cryptography,blockchain and digital currency.
  • Supported by:
    National Natural Science Foundation of China(61907042,61702570),Beijing Natural Science Foundation(4194090) and Project of Research Center for Science and Technology Finance and Entrepreneurship Finance,Key Research Base of Humanities and Social Sciences,Sichuan Provincial Department of Education(JR2018-2).

Abstract: Resampling is an important method to solve imbalanced data classification problem.However,when the size of data set is very small,undersampling will lose important information of the data set,so oversampling is the research focus of imbalanced data classification.Although the existing oversampling methods partially solve the problem of imbalance between classes,they essentially do not introduce additional information to minority class,and there is still a risk of overfitting.To solve these problems,VTO,an oversampling method based on variance migration of the majority class,is proposed in this paper.In this method,a shift vector is extracted from majority class,and the feature weight matrix of the minority class and the majority class is used for adjustment.Furthermore,the shift vectors filtered by the confidence conditions are superimposed to the center of the minority class,so as to introduce the majority class variance in the generation process of new minority class samples,then enrich the minority class feature space.In order to verify the effectiveness of the proposed algorithm,decision tree is used as classification model to train on 6 KEEL data sets.Compared with SMOTEENN and other over-sampling methods,with F-score and PR-AUC values as evaluation indexes,the results show that VTO is more advantageous in dealing with imbalanced data classification.

Key words: Imbalanced data, Classification, Oversampling, Variance transfer, Covariance

CLC Number: 

  • TP311
[1]ZHENG Y,WANG M.Imbalanced problem in initial coin offe-ring fraud detection[C]//Proceedings of the Data Science.Singapore,2022.
[2]CHEN L,XU G,ZHANG Q,et al.Learning deep representation of imbalanced SCADA data for fault detection of wind turbines[J].Measurement,2019,139.
[3]ZENG M,ZOU B,WEI F,et al.Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data[C]//IEEE International Conference of Online Analysis.2016:225-228.
[4]LIN X,CHEN Z,WANG Z.Aspect-level sentiment classification based on imbalanced data and ensemble learning[J].Computer Science,2022,49(S1):144-149.
[5]GUZMÁN-PONCE A,SÁNCHEZ J S,VALDOVINOS R M,et al.DBIG-US:A two-stage under-sampling algorithm to face the class imbalance problem[J].Expert Systems with Applications,2021,168:114301.
[6]JIN X,WANG L,SUN G,et al.Under-sampling Method forUnbalanced Data Based on Centroid Space[J].Computer Science,2019,46(2):50-55.
[7]KHUSHI M,SHAUKAT K,ALAM T M,et al.A Comparative Performance Analysis of Data Resampling Methods on Imba-lance Medical Data[J].IEEE Access,2021,9:109960-109975.
[8]CHAWLA N,BOWYER K,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].arXiv:1106.1813,2011.
[9]BARUA S,ISLAM M M,YAO X,et al.MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning[J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):405-425.
[10]ASNIAR,MAULIDEVI N U,SURENDRO K.SMOTE-LOFfor noise identification in imbalanced data classification[J].Journal of King Saud University-Computer and Information Science,2022,34(6):3413-3423.
[11]HAIRANI H,SAPUTRO K E,FADLI S.K-means-SMOTE untuk menangani ketidakseimbangan kelas dalam klasifikasi penyakit diabetes dengan C4.5,SVM,dan naive Bayes[J].Jurnal Teknologi dan Sistem Komputer.2020:5.
[12]ZHOU X,CAO F,YU L.Bi-directional oversampling method based on sample stratification[J].Computer Science,2019,46(12):83-88.
[13]ZHAO K,JIN X,WANG Y.Survey on few-shot learning[J].Journal of Software,2021,32(2):349-69.
[14]LIU J,SUN Y,HAN C,et al.Deep representation learning on long-tailed data:A learnable embedding augmentation perspective[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020.
[15]ALCALÁ-FDEZ J,FERNÁNDEZ A,LUENGO J,et al.KEEL Data-Mining Software Tool:Data Set Repository[J].Integration of Algorithms and Experimental Analysis Framework,2011,17:255-287.
[1] LI Guo, CHEN Chen, YANG Jing, QUN Nuo. Study on Tibetan Short Text Classification Based on DAN and FastText [J]. Computer Science, 2024, 51(6A): 230700064-5.
[2] HUANG Rui, XU Ji. Text Classification Based on Invariant Graph Convolutional Neural Networks [J]. Computer Science, 2024, 51(6A): 230900018-5.
[3] SU Ruqi, BIAN Xiong, ZHU Songhao. Few-shot Images Classification Based on Clustering Optimization Learning [J]. Computer Science, 2024, 51(6A): 230300227-7.
[4] LYU Yiming, WANG Jiyang. Iron Ore Image Classification Method Based on Improved Efficientnetv2 [J]. Computer Science, 2024, 51(6A): 230600212-6.
[5] MENG Xiangfu, REN Quanying, YANG Dongshen, LI Keqian, YAO Keyu, ZHU Yan. Literature Classification of Individual Reports of Adverse Drug Reactions Based on BERT and CNN [J]. Computer Science, 2024, 51(6A): 230400049-6.
[6] CAO Yan, ZHU Zhenfeng. DRSTN:Deep Residual Soft Thresholding Network [J]. Computer Science, 2024, 51(6A): 230400112-7.
[7] LIANG Meiyan, FAN Yingying, WANG Lin. Fine-grained Colon Pathology Images Classification Based on Heterogeneous Ensemble Learningwith Multi-distance Measures [J]. Computer Science, 2024, 51(6A): 230400043-7.
[8] LI Xinrui, ZHANG Yanfang, KANG Xiaodong, LI Bo, HAN Junling. Intelligent Diagnosis of Brain Tumor with MRI Based on Ensemble Learning [J]. Computer Science, 2024, 51(6A): 230600043-7.
[9] WANG Yifan, ZHANG Xuefang. Modality Fusion Strategy Research Based on Multimodal Video Classification Task [J]. Computer Science, 2024, 51(6A): 230300212-5.
[10] CHEN Sishuo, WANG Xiaodong, LIU Xiyang. Survey of Breast Cancer Pathological Image Analysis Methods Based on Graph Neural Networks [J]. Computer Science, 2024, 51(6): 172-185.
[11] LI Yilin, SUN Chengsheng, LUO Lin, JU Shenggen. Aspect-based Sentiment Classification for Word Information Enhancement Based on Sentence Information [J]. Computer Science, 2024, 51(6): 299-308.
[12] CHU Xiaoxi, ZHANG Jianhui, ZHANG Desheng, SU Hui. Browser Fingerprint Tracking Based on Improved GraphSAGE Algorithm [J]. Computer Science, 2024, 51(6): 409-415.
[13] JIA Fan, YIN Xiaokang, GAI Xianzhe, CAI Ruijie, LIU Shengli. Function-call Instruction Characteristic Analysis Based Instruction Set Architecture Recognization Method for Firmwares [J]. Computer Science, 2024, 51(6): 423-433.
[14] XU Xuejie, WANG Baohui. Multi-label Patent Classification Based on Text and Historical Data [J]. Computer Science, 2024, 51(5): 172-178.
[15] LI Zichen, YI Xiuwen, CHEN Shun, ZHANG Junbo, LI Tianrui. Government Event Dispatch Approach Based on Deep Multi-view Network [J]. Computer Science, 2024, 51(5): 216-222.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!