计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230600199-7.doi: 10.11896/jsjkx.230600199
卓佩妍, 张瑶娜, 刘炜, 刘自金, 宋友
ZHUO Peiyan, ZHANG Yaona, LIU Wei, LIU Zijin, SONG You
摘要: 在金融行业中,信贷欺诈检测是一项重要的工作,能够为银行和消金机构减少大量的经济损失。然而,信贷数据中存在类别不平衡和正负样本特征重叠的问题,导致少数类识别灵敏度低且不同类别数据区分度低。针对这些问题,提出一种面向信贷欺诈检测的CTGANBoost方法。首先,在AdaBoost(Adaptive Boosting)方法的每一轮Boosting迭代中,引入基于类别标签信息约束的CTGAN(Conditional Tabular Generative Adversarial Network)方法学习特征分布,进行少数类数据增强工作;其次,基于CTGAN合成的增强数据集,设计了权重归一化方法,确保在样本加权过程中保持原始数据集的分布特征和相对权重。在3个开源数据集上的实验结果表明,CTGANBoost方法的表现均优于其他主流的信贷欺诈检测方法,AUC值提升了0.5%~2.0%,F1值提升了0.6%~1.8%,验证了CTGANBoost方法的有效性和泛化能力。
中图分类号:
[1]AWOYEMI J O,ADETUNMBI A O,OLUWADARE S A.Credit card fraud detection using machine learning techniques:A comparative analysis[C]//2017 International Conference on Computing Networking and Informatics(ICCNI).IEEE,2017:1-9. [2]MISHRA S.Handling imbalanced data:SMOTE vs.random undersampling[J].Int.Res.J.Eng.Technol,2017,4(8):317-320. [3]LI Z,HUANG M,LIU G,et al.A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection[J].Expert Systems with Applications,2021,175:114750. [4]CARCILLO F,LE BORGNE Y A,CAELEN O,et al.Combining unsupervised and supervised learning in credit card fraud detection[J].Information Sciences,2021,557:317-331. [5]MOHAMMED R,RAWASHDEH J,ABDULLAH M.Machine learning with oversampling and undersampling techniques:overview study and experimental results[C]//2020 11th International Conference on Information and Communication Systems(ICICS).IEEE,2020:243-248. [6]FERNÁNDEZ A,GARCIA S,HERRERA F,et al.SMOTE for learning from imbalanced data:progress and challenges,marking the 15-year anniversary[J].Journal of Artificial Intelligence Research,2018,61:863-905. [7]BRANDT J,LANZÉN E.A comparative review of SMOTE and ADASYN in imbalanced data classification[J/OL].2021.https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1519153&dswid=-3893. [8]LIN W C,TSAI C F,HU Y H,et al.Clustering-based undersampling in class-imbalanced data[J].Information Sciences,2017,409:17-26. [9]FERNÁNDEZ A,GARCÍA S,GALAR M,et al.Cost-sensitive learning[M]//Learning from Imbalanced Data Sets,2018:63-78. [10]SELIYA N,ABDOLLAH ZADEH A,KHOSHGOFTAAR TM.A literature review on one-class classification and its potential applications in big data[J].Journal of Big Data,2021,8(1):1-31. [11]TANHA J,ABDI Y,SAMADI N,et al.Boosting methods for multi-class imbalanced data classification:an experimental review[J].Journal of Big Data,2020,7:1-47. [12]DOUZAS G,BACAO F,LAST F.Improving imbalanced lear-ning through a heuristic oversampling method based on k-means and SMOTE[J].Information Sciences,2018,465:1-20. [13]MALDONADO S,LÓPEZ J,VAIRETTI C.An alternativeSMOTE oversampling strategy for high-dimensional datasets[J].Applied Soft Computing,2019,76:380-389. [14]LU C,LIN S,LIU X,et al.Telecom fraud identification based on ADASYN and random forest[C]//2020 5th International Conference on Computer and Communication Systems(ICCCS).IEEE,2020:447-452. [15]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative adversarial networks[J].Communications of the ACM,2020,63(11):139-144. [16]XU L,SKOULARIDOU M,CUESTA-INFANTE A,et al.Modeling tabular data using conditional gan[J/OL].Advances in Neural Information Processing Systems,2019,32.https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10-936522dd547b78-Abstract.html. [17]ZHAO Z,KUNAR A,BIRKE R,et al.Ctab-gan:Effective table data synthesizing[C]//Asian Conference on Machine Learning.PMLR,2021:97-112. [18]CHOI E,BISWAL S,MALIN B,et al.Generating multi-labeldiscrete patient records using generative adversarial networks[C]//Machine Learning for Healthcare Conference.PMLR,2017:286-305. [19]RAJABI A,GARIBAY O O.Tabfairgan:Fair tabular data generation with generative adversarial networks[J].Machine Learning and Knowledge Extraction,2022,4(2):488-501. [20]VUTTIPITTAYAMONGKOL P,ELYAN E.Neighbourhood-based undersampling approach for handling imbalanced and overlapped data[J].Information Sciences,2020,509:47-70. [21]BUNKHUMPORNPAT C,SINAPIROMSARAN K.DBMUTE:density-based majority under-sampling technique[J].Knowledge and Information Systems,2017,50:827-850. [22]FU G H,WU Y J,ZONG M J,et al.Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics[J].Chemometrics and Intelligent Laboratory Systems,2020,196:103906. [23]OMAR B,RUSTAM F,MEHMOOD A,et al.Minimizing theoverlapping degree to improve class-imbalanced learning under sparse feature selection:application to fraud detection[J].IEEE Access,2021,9:28101-28110. [24]LI F,WANG B,SHEN Y,et al.An overlapping oriented imba-lanced ensemble learning algorithm withweighted projection clustering grouping and consistent fuzzy sample transformation[J].Information Sciences,2023,637:118955. [25]JIANG H X,JIANG J Y,LIANG X.Review on Fraud Detection of Credit Card Transactions Based on Machine Learning[J/OL].Computer Engineering and Applications:1-29.[2023-06-03].http://kns.cnki.net/kcms/detail/11.2127.tp.20230424.1411.014.html. [26]XUAN S,LIU G,LI Z,et al.Random forest for credit card fraud detection[C]//2018 IEEE 15th International Conference on Networking,Sensing and Control(ICNSC).IEEE,2018:1-6. [27]MENG C,ZHOU L,LIU B.A case study in credit fraud detection with SMOTE and XGboost[C]//Journal of Physics:Conference Series.IOP Publishing,2020:052016. [28]FU K,CHENG D,TU Y,et al.Credit card fraud detection using convolutional neural networks[C]//23rd International Confe-rence Neural Information Processing:(ICONIP 2016)Kyoto,Japan,Part III 23.Springer International Publishing,2016:483-490. [29]BAHNSEN A C,AOUADA D,STOJANOVIC A,et al.Feature engineering strategies for credit card fraud detection[J].Expert Systems with Applications,2016,51:134-142. [30]CHEN J I Z,LAI K L.Deep convolution neural network model for credit-card fraud detection and alert[J].Journal of Artificial Intelligence,2021,3(2):101-112. [31]CARCILLO F,LE BORGNE Y A,CAELEN O,et al.Combining unsupervised and supervised learning in credit card fraud detection[J].Information Sciences,2021,557:317-331. [32]ARJOVSKY M,CHINTALA S,BOTTOUL.Wasserstein GAN[OL].https://proceedings.mlr.press/v70/arjovsky17a.html. [33]LIN Z,KHETAN A,FANTI G,et al.Pacgan:The power of two samples in generative adversarial networks[J/OL].Advances in Neural Information Processing Systems,2018,31.https://xplorestaging.ieee.org/document/9046238. [34]SANTURKAR S,TSIPRAS D,ILYAS A,et al.How does batch normalization help optimization?[J/OL].Advances in Neural Information Processing Systems,2018,31.https://proceedings.neurips.cc/paper/2018/hash/905056c1ac1dad141560467e0a99-e1cf-Abstract.html. [35]HUIJBEN I A M,KOOL W,PAULUS M B,et al.A review of the gumbel-max trick and its extensions for discrete stochasticity in machine learning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(2):1353-1371. [36]CHENG G,PEDDINTI V,POVEY D,et al.An Exploration of Dropout with LSTMs[C]//Interspeech.2017:1586-1590. |
|