计算机科学 ›› 2023, Vol. 50 ›› Issue (1): 59-68.doi: 10.11896/jsjkx.220800191
陈奕君, 高浩然, 丁志军
CHEN Yijun, GAO Haoran, DING Zhijun
摘要: 随着计算机技术的发展,利用机器学习算法构建自动化评估模型已经成为金融机构进行信用评估的重要手段。然而,目前信用评估模型仍存在一些问题:信用数据本身存在类别不平衡和高维特征的问题,并且不同的时间下外界环境的改变会影响信用主体的行为,即数据会产生概念漂移现象。为此,文中提出了一个动态的信用评估模型,通过集成学习在新的增量数据上训练基分类器,并对各个基分类器的权重进行动态调整来适应概念漂移,以实现模型的动态更新。当发生概念漂移时,会针对概念漂移的检测结果对高维不平衡的信用数据进行不同形式的均衡化和特征选择。特别地,针对特征选择,文中提出了结合历史代表性样本的增量特征选择算法,该算法能够进行高效准确的特征选择,从而使模型可以同时解决增量信用数据存在的高维不平衡和概念漂移问题。最后,文中选取了真实的增量高维信用数据集,验证了所提算法相比其他主流算法在准确率和效率上的优越性。
中图分类号:
[1]YUAN Y,GONG X,GUO M,et al.Research on Personal Credit Evaluation of Commercial Banks Under Ensemble Learning Framework[C]//2020 2nd International Conference on Applied Machine Learning(ICAML).IEEE,2020:29-38. [2]LU J,LIU A,DONG F,et al.Learning Under Concept Drift:A Review[J].IEEE Transactions on Knowledge and Data Engineering,2018,31(12):2346-2363. [3]KRAWCZYK B.Learning from Imbalanced Data:Open Challenges and Future Directions[J].Progress in Artificial Intelligence,2016,5(4):221-232. [4]ARYA S,ECKEL C,WICHMAN C.Anatomy of the CreditScore[J].Journal of Economic Behavior & Organization,2013,95:175-185. [5]DONG G,LAI K K,YEN J.Credit scorecard based on logistic regression with random coefficients[J].Procedia Computer Science,2010,1(1):2463-2468. [6]HAND D J,HENLEY W E.Statistical Classification Methods in Consumer Credit Scoring:A Review [J].Journal of the Royal Statistical Society,1997,160(3):523-541. [7]DANENAS P,GARSVA G.Selection of Support Vector Ma-chines Based Classifiers for Credit Risk Domain [J].Expert Systems with Applications,2015,42(6):3194-3203. [8]HARRIS T.Credit Scoring Using the Clustered Support Vector Machine [J].Expert Systems with Applications,2015,42(2):741-750. [9]ONG C S,HUANG J J,TZENG G H.Building Credit Scoring Models Using Genetic Programming [J].Expert Systems with Applications,2005,29(1):41-47. [10]WEST D.Neural Network Credit Scoring Models [J].Compu-ters & Operations Research,2000,27(11):1131-1152. [11]SUN J,LANG J,FUJITA H,et al.Imbalanced Enterprise Cre-dit Evaluation with DTE-SBD:Decision Tree Ensemble Based on SMOTE and Bagging with Differentiated Sampling Rates[J/OL].Information Sciences,2018,425:76-91.https://www.sciencedirect.com/science/article/pii/S0020025517310083. [12]ZHANG W,HE H,ZHANG S.A Novel Multi-stage Hybrid Model with Enhanced Multi-Population Niche Genetic Algorithm:An Application in Credit Scoring[J/OL].Expert Systems with Applications,2018,121:221-232.https://www.sciencedirect.com/science/article/pii/S0957417418307887. [13]BARDDAL J P,LOEZER L,ENEMBRECK F,et al.Lessons Learned From Data Stream Classification Applied to Credit Scoring[J/OL].Expert Systems with Applications,2020,162:113899.https://www.sciencedirect.com/science/article/pii/S0167268111001259. [14]CAI Y,JIANG Y.Credit Scoring Using Incremental LearningAlgorithm for SVDD[C]//2016 International Conference on Computer,Information and Telecommunication Systems(CITS).IEEE,2016:1-4. [15]PONTIL M,VERRI A.Properties of Support Vector Machines[J].Neural Computation,1998,10(4):955-974. [16]TAX D M J,DUIN R P W.Support Vector Data Description[J].Machine learning,2004,54(1):45-66. [17]TIAN J,LIU X,LI M.An Incremental Learning EnsembleMethod for Imbalanced Credit Scoring[C]//2019 IEEE Symposium Series on Computational Intelligence(SSCI).IEEE,2019:754-759. [18]VENKATESH B,ANURADHA J.A Review of Feature Selection and Its Methods[J].Cybernetics and Information Technologies,2019,19(1):3-26. [19]GUYON I,ELISSEEFF A.An Introduction to Variable andFeature Selection[J].Journal of Machine Learning Research,2003,3(5):1157-1182. [20]SHU W,QIAN W,XIE Y.Incremental Feature Selection forDynamic Hybrid Data Using Neighborhood Rough Set[J/OL].Knowledge-Based Systems,2020,194:105516.https://www.sciencedirect.com/science/article/pii/S0950705120300289. [21]SANG B,CHEN H,YANG L,et al.Incremental Feature Selection Using a Conditional Entropy Based on Fuzzy Dominance Neighborhood Rough Sets[J].IEEE Transactions on Fuzzy Systems,2021,30(6):1683-1697. [22]ŽLIOBAITE· I,PECHENIZKIY M,GAMA J.Big Data Analysis:New Algorithms for a New Society[M].Cham,Switzerland:Springer International Publishing,2016:91-114. [23]ELWELL R,POLIKAR R.Incremental Learning of ConceptDrift in Nonstationary Environments[J].IEEE Transactions on Neural Networks,2011,22(10):1517-1531. [24]ZHANG S,LIU J,ZUO X.Adaptive Online Incremental Lear-ning for Evolving Data Streams[J/OL].Applied Soft Computing,2021,105:107255.https://www.sciencedirect.com/science/article/pii/S1568494621001782. [25]LI Z,HUANG W,XIONG Y,et al.Incremental Learning Imba-lanced Data Streams with Concept Drift:The Dynamic Updated Ensemble Algorithm[J/OL].Knowledge-Based Systems,2020,195:105694.https://www.sciencedirect.com/science/article/pii/S095070512030126X. [26]DUBOIS D,PRADE H.Rough Fuzzy Sets and Fuzzy RoughSets[J].International Journal of General System,1990,17(2/3):191-209. [27]ZHANG X,MEI C,CHEN D,et al.Feature Selection in Mixed Data:A Method Using a Novel Fuzzy Rough Set Based Information Entropy[J/OL].Pattern Recognition,2016,56:1-15.https://www.sciencedirect.com/science/article/pii/S0031320316000844. [28]ZHANG X,MEI C,CHEN D,et al.Active Incremental Feature Selection Using a Fuzzy-Rough-Set-Based Information Entropy[J].IEEE Transactions on Fuzzy Systems,2019,28(5):901-915. [29]BARANDELA R,VALDOVINOS R M,SÁNCHEZ J S.NewApplications of Ensembles of Classifiers[J].Pattern Analysis & Applications,2003,6(3):245-256. [30]CHANG S,SHIHONG Y,QI L.Clustering Characteristics of UCI Dataset[C]//2020 39th Chinese Control Conference(CCC).IEEE,2020:6301-6306. [31]YANG Y,CHEN D,WANG H,et al.Fuzzy Rough Set Based Incremental Attribute Reduction from Dynamic Data with Sample Arriving[J/OL].Fuzzy Sets and Systems,2017,312:66-86.https://www.sciencedirect.com/science/article/pii/S0167404820301231. [32]LI X K,CHEN W,ZHANG Q,et al.Building Auto-Encoder Intrusion Detection System Based on Random Forest Feature Selection[J/OL].Computers & Security,2020,95:101851.https://www.sciencedirect.com/science/article/pii/S0167404820301231. [33]GHOSH M,GUHA R,ALAM I,et al.Binary Genetic SwarmOptimization:A Combination of GA and PSO for Feature Selection[J].Journal of Intelligent Systems,2020,29(1):1598-1610. [34]CHEN S,HE H.Towards Incremental Learning of Nonstatio-nary Imbalanced Data Stream:A Multiple Selectively Recursive Approach[J].Evolving Systems,2011,2(1):35-50. [35]SUN Y,TANG K,MINKU L L,et al.Online Ensemble Lear-ning of Data Streams with Gradually Evolved Classes[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(6):1532-1545. |
[1] | 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙. 数据流概念漂移处理方法研究综述 Survey of Concept Drift Handling Methods in Data Streams 计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112 |
[2] | 李斌, 万源. 基于相似度矩阵学习和矩阵校正的无监督多视角特征选择 Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment 计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124 |
[3] | 陈圆圆, 王志海. 基于聚类分区的多维数据流概念漂移检测方法 Concept Drift Detection Method for Multidimensional Data Stream Based on Clustering Partition 计算机科学, 2022, 49(7): 25-30. https://doi.org/10.11896/jsjkx.210600155 |
[4] | 胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092 |
[5] | 康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩. 混合改进的花授粉算法与灰狼算法用于特征选择 Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection 计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135 |
[6] | 储安琪, 丁志军. 基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理 Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation 计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075 |
[7] | 孙林, 黄苗苗, 徐久成. 基于邻域粗糙集和Relief的弱标记特征选择方法 Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief 计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094 |
[8] | 夏源, 赵蕴龙, 范其林. 基于信息熵更新权重的数据流集成分类算法 Data Stream Ensemble Classification Algorithm Based on Information Entropy Updating Weight 计算机科学, 2022, 49(3): 92-98. https://doi.org/10.11896/jsjkx.210200047 |
[9] | 李宗然, 陈秀宏, 陆赟, 邵政毅. 鲁棒联合稀疏不相关回归 Robust Joint Sparse Uncorrelated Regression 计算机科学, 2022, 49(2): 191-197. https://doi.org/10.11896/jsjkx.210300034 |
[10] | 王盼红, 朱昌明. MIF-CNNIF:一种基于CNN的交叉特征的多分类图像数据框架 MIF-CNNIF:A Multi-classification Image Data Framework Based on CNN with Intersect Features 计算机科学, 2022, 49(11A): 210800267-8. https://doi.org/10.11896/jsjkx.210800267 |
[11] | 俞赛赛, 王小娟, 章倩倩. 基于启发式搜索特征选择的加密流量恶意行为检测技术 Detection of Malicious Behavior in Encrypted Traffic Based on Heuristic Search Feature Selection 计算机科学, 2022, 49(11A): 210800237-6. https://doi.org/10.11896/jsjkx.210800237 |
[12] | 李永红, 汪盈, 李腊全, 赵志强. 一种改进的特征选择算法在邮件过滤中的应用 Application of Improved Feature Selection Algorithm in Spam Filtering 计算机科学, 2022, 49(11A): 211000028-5. https://doi.org/10.11896/jsjkx.211000028 |
[13] | 闫振超, 舒文豪, 谢昕. 动态部分标记混合数据的增量式特征选择算法 Incremental Feature Selection Algorithm for Dynamic Partially Labeled Hybrid Data 计算机科学, 2022, 49(11): 98-108. https://doi.org/10.11896/jsjkx.210900076 |
[14] | 王修君, 莫磊, 郑啸, 高云全. 面向数据流滑动窗口的自适应直方图发布算法 Adaptive Histogram Publishing Algorithm for Sliding Window of Data Stream 计算机科学, 2022, 49(10): 344-352. https://doi.org/10.11896/jsjkx.210700242 |
[15] | 张叶, 李志华, 王长杰. 基于核密度估计的轻量级物联网异常流量检测方法 Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method 计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108 |
|