计算机科学 ›› 2023, Vol. 50 ›› Issue (1): 59-68.doi: 10.11896/jsjkx.220800191

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于动态机器学习的信用评估模型

陈奕君, 高浩然, 丁志军   

  1. 嵌入式系统与服务计算教育部重点实验室(同济大学) 上海 201804
    上海市网络金融安全协同创新中心(同济大学) 上海 201804
  • 收稿日期:2022-08-19 修回日期:2022-09-21 出版日期:2023-01-15 发布日期:2023-01-09
  • 通讯作者: 丁志军(dingzj@tongji.edu.cn)
  • 作者简介:2232918@tongji.edu.cn
  • 基金资助:
    上海市科技创新行动计划(19511101300)

Credit Evaluation Model Based on Dynamic Machine Learning

CHEN Yijun, GAO Haoran, DING Zhijun   

  1. Key Laboratory of Embedded System and Service Computing of Ministry of Education,Tongji University,Shanghai 201804,China
    Shanghai Network Finance Security Collaborative Innovation Center,Tongji University,Shanghai 201804,China
  • Received:2022-08-19 Revised:2022-09-21 Online:2023-01-15 Published:2023-01-09
  • About author:CHEN Yijun,born in 2000,postgra-duate.Her main research interests include data mining and machine lear-ning.
    DING Zhijun,born in 1974,Ph.D,professor,Ph.D supervisor,is a senior member of China Computer Federation.His main research interests include intelligent software engineering,cloud computing and services,big data credit reporting and financial risk control.
  • Supported by:
    Shanghai Science and Technology Innovation Action Plan(19511101300).

摘要: 随着计算机技术的发展,利用机器学习算法构建自动化评估模型已经成为金融机构进行信用评估的重要手段。然而,目前信用评估模型仍存在一些问题:信用数据本身存在类别不平衡和高维特征的问题,并且不同的时间下外界环境的改变会影响信用主体的行为,即数据会产生概念漂移现象。为此,文中提出了一个动态的信用评估模型,通过集成学习在新的增量数据上训练基分类器,并对各个基分类器的权重进行动态调整来适应概念漂移,以实现模型的动态更新。当发生概念漂移时,会针对概念漂移的检测结果对高维不平衡的信用数据进行不同形式的均衡化和特征选择。特别地,针对特征选择,文中提出了结合历史代表性样本的增量特征选择算法,该算法能够进行高效准确的特征选择,从而使模型可以同时解决增量信用数据存在的高维不平衡和概念漂移问题。最后,文中选取了真实的增量高维信用数据集,验证了所提算法相比其他主流算法在准确率和效率上的优越性。

关键词: 信用评估, 特征选择, 概念漂移, 滑动窗口, 动态模型

Abstract: With the development of computer technology,using machine learning algorithms to build automated evaluation models has become an important tool to for the financial institutions to conduct credit evaluation.However,currently,the credit evaluation model is still facing challenges:credit data is class-imbalanced and high-dimensional,meanwhile,the behavior of customers can be influenced by the changeable external environment,namely,the concept drift will occur.As a result,this paper proposes a dynamic credit evaluation model,which can achieve the flexible model update by using ensemble learning algorithm to continuously add base classifiers trained on new incremental data,and dynamically adjusting the weight of each base classifier to adapt to concept drift.When concept drift occurs,according to the detection results of concept drift,the model is able to use different forms of equalization and feature selection on credit data.In particular,for feature selection,this paper proposes an incremental feature selection algorithm combining the choice of representative samples that makes the feature selection efficient and accurate,enabling the model to simultaneously process the high-dimensional imbalanced data and adapt the concept drift of incremental credit data.Finally,this paper manages to demonstrate that the proposed dynamic model is more efficient and accurate than other prevailing algorithms on real incremental high-dimensional credit datasets.

Key words: Credit evaluation, Feature selection, Concept drift, Sliding window, Dynamic model

中图分类号: 

  • TP3-05
[1]YUAN Y,GONG X,GUO M,et al.Research on Personal Credit Evaluation of Commercial Banks Under Ensemble Learning Framework[C]//2020 2nd International Conference on Applied Machine Learning(ICAML).IEEE,2020:29-38.
[2]LU J,LIU A,DONG F,et al.Learning Under Concept Drift:A Review[J].IEEE Transactions on Knowledge and Data Engineering,2018,31(12):2346-2363.
[3]KRAWCZYK B.Learning from Imbalanced Data:Open Challenges and Future Directions[J].Progress in Artificial Intelligence,2016,5(4):221-232.
[4]ARYA S,ECKEL C,WICHMAN C.Anatomy of the CreditScore[J].Journal of Economic Behavior & Organization,2013,95:175-185.
[5]DONG G,LAI K K,YEN J.Credit scorecard based on logistic regression with random coefficients[J].Procedia Computer Science,2010,1(1):2463-2468.
[6]HAND D J,HENLEY W E.Statistical Classification Methods in Consumer Credit Scoring:A Review [J].Journal of the Royal Statistical Society,1997,160(3):523-541.
[7]DANENAS P,GARSVA G.Selection of Support Vector Ma-chines Based Classifiers for Credit Risk Domain [J].Expert Systems with Applications,2015,42(6):3194-3203.
[8]HARRIS T.Credit Scoring Using the Clustered Support Vector Machine [J].Expert Systems with Applications,2015,42(2):741-750.
[9]ONG C S,HUANG J J,TZENG G H.Building Credit Scoring Models Using Genetic Programming [J].Expert Systems with Applications,2005,29(1):41-47.
[10]WEST D.Neural Network Credit Scoring Models [J].Compu-ters & Operations Research,2000,27(11):1131-1152.
[11]SUN J,LANG J,FUJITA H,et al.Imbalanced Enterprise Cre-dit Evaluation with DTE-SBD:Decision Tree Ensemble Based on SMOTE and Bagging with Differentiated Sampling Rates[J/OL].Information Sciences,2018,425:76-91.https://www.sciencedirect.com/science/article/pii/S0020025517310083.
[12]ZHANG W,HE H,ZHANG S.A Novel Multi-stage Hybrid Model with Enhanced Multi-Population Niche Genetic Algorithm:An Application in Credit Scoring[J/OL].Expert Systems with Applications,2018,121:221-232.https://www.sciencedirect.com/science/article/pii/S0957417418307887.
[13]BARDDAL J P,LOEZER L,ENEMBRECK F,et al.Lessons Learned From Data Stream Classification Applied to Credit Scoring[J/OL].Expert Systems with Applications,2020,162:113899.https://www.sciencedirect.com/science/article/pii/S0167268111001259.
[14]CAI Y,JIANG Y.Credit Scoring Using Incremental LearningAlgorithm for SVDD[C]//2016 International Conference on Computer,Information and Telecommunication Systems(CITS).IEEE,2016:1-4.
[15]PONTIL M,VERRI A.Properties of Support Vector Machines[J].Neural Computation,1998,10(4):955-974.
[16]TAX D M J,DUIN R P W.Support Vector Data Description[J].Machine learning,2004,54(1):45-66.
[17]TIAN J,LIU X,LI M.An Incremental Learning EnsembleMethod for Imbalanced Credit Scoring[C]//2019 IEEE Symposium Series on Computational Intelligence(SSCI).IEEE,2019:754-759.
[18]VENKATESH B,ANURADHA J.A Review of Feature Selection and Its Methods[J].Cybernetics and Information Technologies,2019,19(1):3-26.
[19]GUYON I,ELISSEEFF A.An Introduction to Variable andFeature Selection[J].Journal of Machine Learning Research,2003,3(5):1157-1182.
[20]SHU W,QIAN W,XIE Y.Incremental Feature Selection forDynamic Hybrid Data Using Neighborhood Rough Set[J/OL].Knowledge-Based Systems,2020,194:105516.https://www.sciencedirect.com/science/article/pii/S0950705120300289.
[21]SANG B,CHEN H,YANG L,et al.Incremental Feature Selection Using a Conditional Entropy Based on Fuzzy Dominance Neighborhood Rough Sets[J].IEEE Transactions on Fuzzy Systems,2021,30(6):1683-1697.
[22]ŽLIOBAITE· I,PECHENIZKIY M,GAMA J.Big Data Analysis:New Algorithms for a New Society[M].Cham,Switzerland:Springer International Publishing,2016:91-114.
[23]ELWELL R,POLIKAR R.Incremental Learning of ConceptDrift in Nonstationary Environments[J].IEEE Transactions on Neural Networks,2011,22(10):1517-1531.
[24]ZHANG S,LIU J,ZUO X.Adaptive Online Incremental Lear-ning for Evolving Data Streams[J/OL].Applied Soft Computing,2021,105:107255.https://www.sciencedirect.com/science/article/pii/S1568494621001782.
[25]LI Z,HUANG W,XIONG Y,et al.Incremental Learning Imba-lanced Data Streams with Concept Drift:The Dynamic Updated Ensemble Algorithm[J/OL].Knowledge-Based Systems,2020,195:105694.https://www.sciencedirect.com/science/article/pii/S095070512030126X.
[26]DUBOIS D,PRADE H.Rough Fuzzy Sets and Fuzzy RoughSets[J].International Journal of General System,1990,17(2/3):191-209.
[27]ZHANG X,MEI C,CHEN D,et al.Feature Selection in Mixed Data:A Method Using a Novel Fuzzy Rough Set Based Information Entropy[J/OL].Pattern Recognition,2016,56:1-15.https://www.sciencedirect.com/science/article/pii/S0031320316000844.
[28]ZHANG X,MEI C,CHEN D,et al.Active Incremental Feature Selection Using a Fuzzy-Rough-Set-Based Information Entropy[J].IEEE Transactions on Fuzzy Systems,2019,28(5):901-915.
[29]BARANDELA R,VALDOVINOS R M,SÁNCHEZ J S.NewApplications of Ensembles of Classifiers[J].Pattern Analysis & Applications,2003,6(3):245-256.
[30]CHANG S,SHIHONG Y,QI L.Clustering Characteristics of UCI Dataset[C]//2020 39th Chinese Control Conference(CCC).IEEE,2020:6301-6306.
[31]YANG Y,CHEN D,WANG H,et al.Fuzzy Rough Set Based Incremental Attribute Reduction from Dynamic Data with Sample Arriving[J/OL].Fuzzy Sets and Systems,2017,312:66-86.https://www.sciencedirect.com/science/article/pii/S0167404820301231.
[32]LI X K,CHEN W,ZHANG Q,et al.Building Auto-Encoder Intrusion Detection System Based on Random Forest Feature Selection[J/OL].Computers & Security,2020,95:101851.https://www.sciencedirect.com/science/article/pii/S0167404820301231.
[33]GHOSH M,GUHA R,ALAM I,et al.Binary Genetic SwarmOptimization:A Combination of GA and PSO for Feature Selection[J].Journal of Intelligent Systems,2020,29(1):1598-1610.
[34]CHEN S,HE H.Towards Incremental Learning of Nonstatio-nary Imbalanced Data Stream:A Multiple Selectively Recursive Approach[J].Evolving Systems,2011,2(1):35-50.
[35]SUN Y,TANG K,MINKU L L,et al.Online Ensemble Lear-ning of Data Streams with Gradually Evolved Classes[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(6):1532-1545.
[1] 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙.
数据流概念漂移处理方法研究综述
Survey of Concept Drift Handling Methods in Data Streams
计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112
[2] 李斌, 万源.
基于相似度矩阵学习和矩阵校正的无监督多视角特征选择
Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment
计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124
[3] 陈圆圆, 王志海.
基于聚类分区的多维数据流概念漂移检测方法
Concept Drift Detection Method for Multidimensional Data Stream Based on Clustering Partition
计算机科学, 2022, 49(7): 25-30. https://doi.org/10.11896/jsjkx.210600155
[4] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[5] 康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩.
混合改进的花授粉算法与灰狼算法用于特征选择
Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection
计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135
[6] 储安琪, 丁志军.
基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理
Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation
计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075
[7] 孙林, 黄苗苗, 徐久成.
基于邻域粗糙集和Relief的弱标记特征选择方法
Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief
计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094
[8] 夏源, 赵蕴龙, 范其林.
基于信息熵更新权重的数据流集成分类算法
Data Stream Ensemble Classification Algorithm Based on Information Entropy Updating Weight
计算机科学, 2022, 49(3): 92-98. https://doi.org/10.11896/jsjkx.210200047
[9] 李宗然, 陈秀宏, 陆赟, 邵政毅.
鲁棒联合稀疏不相关回归
Robust Joint Sparse Uncorrelated Regression
计算机科学, 2022, 49(2): 191-197. https://doi.org/10.11896/jsjkx.210300034
[10] 王盼红, 朱昌明.
MIF-CNNIF:一种基于CNN的交叉特征的多分类图像数据框架
MIF-CNNIF:A Multi-classification Image Data Framework Based on CNN with Intersect Features
计算机科学, 2022, 49(11A): 210800267-8. https://doi.org/10.11896/jsjkx.210800267
[11] 俞赛赛, 王小娟, 章倩倩.
基于启发式搜索特征选择的加密流量恶意行为检测技术
Detection of Malicious Behavior in Encrypted Traffic Based on Heuristic Search Feature Selection
计算机科学, 2022, 49(11A): 210800237-6. https://doi.org/10.11896/jsjkx.210800237
[12] 李永红, 汪盈, 李腊全, 赵志强.
一种改进的特征选择算法在邮件过滤中的应用
Application of Improved Feature Selection Algorithm in Spam Filtering
计算机科学, 2022, 49(11A): 211000028-5. https://doi.org/10.11896/jsjkx.211000028
[13] 闫振超, 舒文豪, 谢昕.
动态部分标记混合数据的增量式特征选择算法
Incremental Feature Selection Algorithm for Dynamic Partially Labeled Hybrid Data
计算机科学, 2022, 49(11): 98-108. https://doi.org/10.11896/jsjkx.210900076
[14] 王修君, 莫磊, 郑啸, 高云全.
面向数据流滑动窗口的自适应直方图发布算法
Adaptive Histogram Publishing Algorithm for Sliding Window of Data Stream
计算机科学, 2022, 49(10): 344-352. https://doi.org/10.11896/jsjkx.210700242
[15] 张叶, 李志华, 王长杰.
基于核密度估计的轻量级物联网异常流量检测方法
Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method
计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!