计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230600199-7.doi: 10.11896/jsjkx.230600199

• 大数据&数据科学 • 上一篇    下一篇

CTGANBoost:基于CTGAN与Boosting的信贷欺诈检测研究

卓佩妍, 张瑶娜, 刘炜, 刘自金, 宋友   

  1. 北京航空航天大学软件学院 北京 100191
  • 发布日期:2024-06-06
  • 通讯作者: 宋友(songyou@buaa.edu.cn)
  • 作者简介:(zhuopy@buaa.edu.cn)
  • 基金资助:
    河北省重点研发计划(21310101D)

CTGANBoost:Credit Fraud Detection Based on CTGAN and Boosting

ZHUO Peiyan, ZHANG Yaona, LIU Wei, LIU Zijin, SONG You   

  1. School of Software,Beihang University,Beijing 100191,China
  • Published:2024-06-06
  • About author:ZHUO Peiyan,born in 1999,postgra-duate.Her main research interests include data mining and financial tech-nology.
    SONG You,born in 1973,professor,Ph.D supervisor.His main research interests include data analysis techniques,financial technology,information processing,and knowledge graph.
  • Supported by:
    Key Research and Development Program of Hebei Province ,China(21310101D).

摘要: 在金融行业中,信贷欺诈检测是一项重要的工作,能够为银行和消金机构减少大量的经济损失。然而,信贷数据中存在类别不平衡和正负样本特征重叠的问题,导致少数类识别灵敏度低且不同类别数据区分度低。针对这些问题,提出一种面向信贷欺诈检测的CTGANBoost方法。首先,在AdaBoost(Adaptive Boosting)方法的每一轮Boosting迭代中,引入基于类别标签信息约束的CTGAN(Conditional Tabular Generative Adversarial Network)方法学习特征分布,进行少数类数据增强工作;其次,基于CTGAN合成的增强数据集,设计了权重归一化方法,确保在样本加权过程中保持原始数据集的分布特征和相对权重。在3个开源数据集上的实验结果表明,CTGANBoost方法的表现均优于其他主流的信贷欺诈检测方法,AUC值提升了0.5%~2.0%,F1值提升了0.6%~1.8%,验证了CTGANBoost方法的有效性和泛化能力。

关键词: 信贷欺诈, 数据类别不平衡, 集成学习, 生成对抗网络, 自适应增强

Abstract: In the financial industry,credit fraud detection is an important task,which can reduce a lot of economic losses for banks and consumer institutions.However,there are problems of class imbalance and overlapping features of positive and negative samples in credit data,which lead to low sensitivity of minority class recognition and low data discrimination.To address these pro-blems,a CTGANBoost method is proposed for credit fraud detection.First,in each Boosting iteration of AdaBoost,the conditional tabular generative adversarial network(CTGAN) method based on class label information constraint is introduced to learn feature distribution for minority class data augmentation.Secondly,based on the enhanced data set synthesized by CTGAN,a weight normalization method is designed to ensure that the distribution characteristics and relative weights of the original data set are maintained during the sample weighting process.Experimental results on three open source datasets show that CTGANBoost outperforms other mainstream credit fraud detection methods,with AUC values increase by 0.5%~2.0% and F1 values increase by 0.6%~1.8%,which verifies the effectiveness and generalization ability of CTGANBoost method.

Key words: Credit fraud, Imbalance data, Ensemble learning, Generative adversarial network, AdaBoost

中图分类号: 

  • TP391
[1]AWOYEMI J O,ADETUNMBI A O,OLUWADARE S A.Credit card fraud detection using machine learning techniques:A comparative analysis[C]//2017 International Conference on Computing Networking and Informatics(ICCNI).IEEE,2017:1-9.
[2]MISHRA S.Handling imbalanced data:SMOTE vs.random undersampling[J].Int.Res.J.Eng.Technol,2017,4(8):317-320.
[3]LI Z,HUANG M,LIU G,et al.A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection[J].Expert Systems with Applications,2021,175:114750.
[4]CARCILLO F,LE BORGNE Y A,CAELEN O,et al.Combining unsupervised and supervised learning in credit card fraud detection[J].Information Sciences,2021,557:317-331.
[5]MOHAMMED R,RAWASHDEH J,ABDULLAH M.Machine learning with oversampling and undersampling techniques:overview study and experimental results[C]//2020 11th International Conference on Information and Communication Systems(ICICS).IEEE,2020:243-248.
[6]FERNÁNDEZ A,GARCIA S,HERRERA F,et al.SMOTE for learning from imbalanced data:progress and challenges,marking the 15-year anniversary[J].Journal of Artificial Intelligence Research,2018,61:863-905.
[7]BRANDT J,LANZÉN E.A comparative review of SMOTE and ADASYN in imbalanced data classification[J/OL].2021.https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1519153&dswid=-3893.
[8]LIN W C,TSAI C F,HU Y H,et al.Clustering-based undersampling in class-imbalanced data[J].Information Sciences,2017,409:17-26.
[9]FERNÁNDEZ A,GARCÍA S,GALAR M,et al.Cost-sensitive learning[M]//Learning from Imbalanced Data Sets,2018:63-78.
[10]SELIYA N,ABDOLLAH ZADEH A,KHOSHGOFTAAR TM.A literature review on one-class classification and its potential applications in big data[J].Journal of Big Data,2021,8(1):1-31.
[11]TANHA J,ABDI Y,SAMADI N,et al.Boosting methods for multi-class imbalanced data classification:an experimental review[J].Journal of Big Data,2020,7:1-47.
[12]DOUZAS G,BACAO F,LAST F.Improving imbalanced lear-ning through a heuristic oversampling method based on k-means and SMOTE[J].Information Sciences,2018,465:1-20.
[13]MALDONADO S,LÓPEZ J,VAIRETTI C.An alternativeSMOTE oversampling strategy for high-dimensional datasets[J].Applied Soft Computing,2019,76:380-389.
[14]LU C,LIN S,LIU X,et al.Telecom fraud identification based on ADASYN and random forest[C]//2020 5th International Conference on Computer and Communication Systems(ICCCS).IEEE,2020:447-452.
[15]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative adversarial networks[J].Communications of the ACM,2020,63(11):139-144.
[16]XU L,SKOULARIDOU M,CUESTA-INFANTE A,et al.Modeling tabular data using conditional gan[J/OL].Advances in Neural Information Processing Systems,2019,32.https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10-936522dd547b78-Abstract.html.
[17]ZHAO Z,KUNAR A,BIRKE R,et al.Ctab-gan:Effective table data synthesizing[C]//Asian Conference on Machine Learning.PMLR,2021:97-112.
[18]CHOI E,BISWAL S,MALIN B,et al.Generating multi-labeldiscrete patient records using generative adversarial networks[C]//Machine Learning for Healthcare Conference.PMLR,2017:286-305.
[19]RAJABI A,GARIBAY O O.Tabfairgan:Fair tabular data generation with generative adversarial networks[J].Machine Learning and Knowledge Extraction,2022,4(2):488-501.
[20]VUTTIPITTAYAMONGKOL P,ELYAN E.Neighbourhood-based undersampling approach for handling imbalanced and overlapped data[J].Information Sciences,2020,509:47-70.
[21]BUNKHUMPORNPAT C,SINAPIROMSARAN K.DBMUTE:density-based majority under-sampling technique[J].Knowledge and Information Systems,2017,50:827-850.
[22]FU G H,WU Y J,ZONG M J,et al.Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics[J].Chemometrics and Intelligent Laboratory Systems,2020,196:103906.
[23]OMAR B,RUSTAM F,MEHMOOD A,et al.Minimizing theoverlapping degree to improve class-imbalanced learning under sparse feature selection:application to fraud detection[J].IEEE Access,2021,9:28101-28110.
[24]LI F,WANG B,SHEN Y,et al.An overlapping oriented imba-lanced ensemble learning algorithm withweighted projection clustering grouping and consistent fuzzy sample transformation[J].Information Sciences,2023,637:118955.
[25]JIANG H X,JIANG J Y,LIANG X.Review on Fraud Detection of Credit Card Transactions Based on Machine Learning[J/OL].Computer Engineering and Applications:1-29.[2023-06-03].http://kns.cnki.net/kcms/detail/11.2127.tp.20230424.1411.014.html.
[26]XUAN S,LIU G,LI Z,et al.Random forest for credit card fraud detection[C]//2018 IEEE 15th International Conference on Networking,Sensing and Control(ICNSC).IEEE,2018:1-6.
[27]MENG C,ZHOU L,LIU B.A case study in credit fraud detection with SMOTE and XGboost[C]//Journal of Physics:Conference Series.IOP Publishing,2020:052016.
[28]FU K,CHENG D,TU Y,et al.Credit card fraud detection using convolutional neural networks[C]//23rd International Confe-rence Neural Information Processing:(ICONIP 2016)Kyoto,Japan,Part III 23.Springer International Publishing,2016:483-490.
[29]BAHNSEN A C,AOUADA D,STOJANOVIC A,et al.Feature engineering strategies for credit card fraud detection[J].Expert Systems with Applications,2016,51:134-142.
[30]CHEN J I Z,LAI K L.Deep convolution neural network model for credit-card fraud detection and alert[J].Journal of Artificial Intelligence,2021,3(2):101-112.
[31]CARCILLO F,LE BORGNE Y A,CAELEN O,et al.Combining unsupervised and supervised learning in credit card fraud detection[J].Information Sciences,2021,557:317-331.
[32]ARJOVSKY M,CHINTALA S,BOTTOUL.Wasserstein GAN[OL].https://proceedings.mlr.press/v70/arjovsky17a.html.
[33]LIN Z,KHETAN A,FANTI G,et al.Pacgan:The power of two samples in generative adversarial networks[J/OL].Advances in Neural Information Processing Systems,2018,31.https://xplorestaging.ieee.org/document/9046238.
[34]SANTURKAR S,TSIPRAS D,ILYAS A,et al.How does batch normalization help optimization?[J/OL].Advances in Neural Information Processing Systems,2018,31.https://proceedings.neurips.cc/paper/2018/hash/905056c1ac1dad141560467e0a99-e1cf-Abstract.html.
[35]HUIJBEN I A M,KOOL W,PAULUS M B,et al.A review of the gumbel-max trick and its extensions for discrete stochasticity in machine learning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(2):1353-1371.
[36]CHENG G,PEDDINTI V,POVEY D,et al.An Exploration of Dropout with LSTMs[C]//Interspeech.2017:1586-1590.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!