计算机科学 ›› 2020, Vol. 47 ›› Issue (11A): 454-458.doi: 10.11896/jsjkx.200600002

• 大数据&数据科学 • 上一篇    下一篇

基于遗传算法与随机森林的XGBoost改进方法研究

王晓晖1, 张亮1, 李俊清1,2, 孙玉翠1, 田捷1, 韩睿毅1   

  1. 1 山东农业大学信息科学与工程学院 山东 泰安 271018
    2 山东农业大学农业大数据研究中心 山东 泰安 271018
  • 出版日期:2020-11-15 发布日期:2020-11-17
  • 通讯作者: 李俊清(a397858801@126.com)
  • 作者简介:wangxh1998@163.com
  • 基金资助:
    大数据驱动下流域水库群联合防洪调度研究(2019GSF111043)

Study on XGBoost Improved Method Based on Genetic Algorithm and Random Forest

WANG Xiao-hui1, ZHANG Liang1, LI Jun-qing1,2, SUN Yu-cui1, TIAN Jie1, HAN Rui-yi1   

  1. 1 School of Information Science and Engineering,Shandong Agricultural University,Taian,Shangdong 271018,China
    2 Agricultural Big Data Research Center,Shandong Agricultural University,Taian,Shangdong 271018,China
  • Online:2020-11-15 Published:2020-11-17
  • About author:WANG Xiao-hui,born in 1998,undergraduate.His main research interests include machine learning and so on.
    LI Jun-qing,born in 1984,postgra-duate,associate professor.His main research interests include artificial intelligence and bigdata.
  • Supported by:
    This work was supported by the Joint Flood Control Operation of Reservoir Groups in River Basin Driven by Digdata (2019GSF111043).

摘要: 回归预测是机器学习中重要的研究方向之一,有着广阔的应用领域。为了进一步提升回归预测的精度,提出了基于遗传算法与随机森林的XGBoost改进方法(GA_XGBoost_RF)。首先利用遗传算法(Genetic Algorithm,GA)良好的搜索能力和灵活性,以交叉验证平均得分为目标函数值,对XGBoost算法和随机森林算法(Random Forest,RF)的参数进行调优,选出较好的参数集,分别建立GA_XGBoost和GA_RF模型。然后对GA_XGBoost和GA_RF进行变权组合,利用训练集的预测值与真实值的均方误差为目标函数,使用遗传算法确定模型的权重。在UCI数据集上进行了实验,结果表明,与XGBoost,Random Forest,GA_XGBoost,GA_RF算法相比,在大部分数据集上GA_XGBoost_RF方法的均方误差、绝对误差和拟合度均优于单一模型,其中在拟合度方面所提方法在不同数据集上提高了约0.01%~2.1%,是一种有效的回归预测方法。

关键词: XGBoost, 回归预测, 随机森林, 遗传算法, 组合预测

Abstract: Regression prediction is one of the important research directions in machine learning and has a broad application field.In order to improve the accuracy of regression prediction,an improved XGBoost method (GA_XGBoost_RF) based on genetic algorithm and random forest is proposed.Firstly,with the good search ability and flexibility of Genetic Algorithm (GA),the XGBoost Algorithm and Random Forest Algorithm (RF) parameters are optimized with the average score of cross-validation as the objective function value,and the better parameter set is selected to establish GA_XGBoost and GA_RF models,respectively.Then the variable weight combination of GA_XGBoost and GA_RF is performed.The mean square error between the predicted value and the real value of the training set is used as the objective function,and the weight of the model is determined by genetic algorithm.On UCI data sets and the results show that the XGBoost and Random Forest,GA_XGBoost,GA_RF algorithm compared to GA_XGBoost_RF method in most of the data set is the fit of the mean square error (mse) and absolute error and are superior to single model,the proposed method on fitting on different data sets improves by about 0.01%~2.1%,is a kind of effective regression forecast method.

Key words: Combination prediction, Genetic algorithm, Random forest, Regression prediction, XGBoost

中图分类号: 

  • TP181
[1] YUAN B,LIU S,JIANG L X,et al.Housing rent prediction model based on random forest regression algorithm[J].ComputerProgramming Skills & Maintenance,2020(1):23-25.
[2] ZHANG C F,WANG S,WU Y D,et al.Diabetes Risk Prediction Based on GA_Xgboost Model[J].Computer Engineering,2020(3):315-320.
[3] WANG Y,GUO Y K.Application of Improved XGBoost Model in Stock Forecasting[J].Computer Engineering and Applications,2019(20):202-207.
[4] CHEN T,GUESTRIN C.Xgboost:A scalable tree boosting system[C]//Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining.ACM,2016:785-794.
[5] CHEN H,WANG R T,XIAO C L,et al.Research on Intrusion Detection Model Based on DBN-XGBDT[J/OL].Computer Engineering and Application.https://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CAPJ&dbname=CAPJLAST&filename=JSGG20200107004&v=UVJbamaWiqN%25mmd2F9O2vyqQDdcTYYvCJ1fZFijf%25mmd2FWeamhJm61AxhCjVV6r5HZkDoH4xo.
[6] CHEN Z Y,LIU J B,LI C,et al.Ultra Short-term Power Load Forecasting Based on Combined LSTM and XGBoost Model[J].Power System Technology,2020(2):1-8.
[7] LI H,ZHU Y.Improving Xgboost Based on Gradient Distribution Regulation Strategy[J].Journal of Computer Applications,2020(1):1-6.
[8] YUE P,HOU L Y,YANG D L,et al.XLC-Stacking method for disease diagnosis based on XGBoost feature selection[J].Computer Engineering and Applications,2020(17):136-141.
[9] WANG Q S,XIE X S,SHE H.Short-term Traffic Flow Prediction Based on CNN-XGBoost Hybrid Model[J].Measurement &Control Technology,2019(4):37-40,67.
[10] LI B,HAN R,HE Y G,et al.Application of Improved Random Forest Algorithm in Fault Diagnosis of Motor Bearings[J].Proceedings of the CSEE,2020(4):1310-1319,1422.
[11] DING D D,SUI L,CHEN S.Machine learning-dynamically coupled vehicle following models[J].Journal of Transportation Systems Engineering and Information Technology,2017(6):33-39.
[12] YUE Y C,HUANG Y Z.A Method for Error Reciprocal Variable Weight Combined Forecasting[J].Journal of University of Electronic Science and Technology of China,2007(S1):349-351.
[13] ZHOU Y S,CUI J Y,ZHOU L Y,et al.Study on the Evaluation of Personal Credit Risk Based on the Improved Random Forest Model[J].Credit Reference,2020(1):25-30.
[14] SONG K,YAN F,DING T,et al.A steel property optimization model based on the XGBoost algorithm and improved PSO[J].Computational Materials Science,2020,174(C).
[15] SHI X P,WONG Y D,LI Z F,et al.A feature learning approach based on XGBoost for driving assessment and risk prediction[J].Accident Analysis and Prevention,2019,129(129).
[16] LIU Z X,WANG X.Flight Delay Prediction Based on Random Forest Regression[J].Modern Computer,2019(15):20-24.
[17] XIE K,RONG Y T,HU F P,et al.Random Forest based on Data Ensembling[J/OL].Computer Engineering.https://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CAPJ&dbname=CAPJLAST&filename=JSJC20191206002&v=0pB3H536puZ4tfXwxmctFHXG08jgxGF4%25mmd2BPhds%25mmd2BTvGl4wpi4FuIthY5Id9ogKmt1A.
[18] SHI J Q,ZHANG J H.Load Forecasting Based on Multi-model by Stacking Ensemble Learning[J].Proceedings of the CSEE,2019(14):4032-4042.
[19] LIU X Z Y,GAN L,XU J H,et al.Automatic Optimization of Parallel Parameters for Sunway TaihuLight SupercomputerApplication Program[J/OL].Journal of Frontiers of Computer Science and Technology.https://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CAPJ&dbname=CAPJLAST&filename=KXTS20200117000&v=jyKKAwjXo98Ft%25mmd2FhCSfCvhikiL1CADBYEajg0LyXpY1lp8Jk8Psm5yiUOe5IvYF23.
[20] LIU J,CHEN H H,ZHANG F F,et al.Multi-parameter identification of river water quality model based on animproved genetic algorithm[J].Journal of Northeast Agricultural University,2020(1):73-82.
[21] XING Z W,HAN D H,LUO Q.Estimationof Flight Support Time Based on improved GA neural network[J].Computer Engineering and Design,2020(1):107-114.
[22] LIN L C.Improved k-means algorithm based on genetic algorithm[J].Electronic Technology & Software Engineering,2020(1):111-112.
[23] NIU W N,LI T,ZHANG X S,et al.Using XGBoost to Discover Infected Hosts Based on HTTP Traffic[J/OL].https://schlr.cnki.net/Detail/index/WWMERGEJ02/SJHDD74B5ADB931A22462D32E1F64048A4BC.
[24] ZHONG Y,SHAO Y M,HU W W,et al.Short-term Traffic Flow Prediction Model Based on XGBoost[J].Science Technology and Engineering,2019(30):337-342.
[25] XIE Y,XIANG Y,JI M Z,et al.An application and analysis of forecast housing rentalbased on xgboost and lightgbm algorithms [J].Computer Applications and Software,2019(9):151-155,191.
[26] WANG M H,LIANG X C.Personal Credit Evaluation Based on CPSO-XGBoost [J].Computer Engineering and Design,2019(7):1891-1895.
[27] HE B,MA J,GAO H Y.A research on forecasting urban daily water-supply based on multi-granularityfeature and XGBoost integrated model[J].Journal of Yangtze River Scientific Research Institute,2020(5):43-49.
[28] LUO X,QIAN Q,FU Y F.Improved Genetic Algorithm for Solving Flexible Job Shop Scheduling Problem[J].Procedia Computer Science,2020,166(166).
[29] MIRALLES-PECHUÁN L,PONCE H,MARTÍNEZ-VILLA-SEÑOR L.A 2020 perspective on “A novel methodology for optimizing display advertising campaigns using genetic algorithms”[J].Electronic Commerce Research and Applications,2020,40(40).
[30] BAI B G,ZHU H L,FAN Q X.Research on Early Warning of Dairy Product Quality and Safety Risk Based on GeneticOptimization BP Neural Network[J/OL].Food Science.https://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CAPJ&dbname=CAPJLAST&filename=SPKX2020032000O&v=17WwU59A5kA%25mmd2FsWQldVPlWn%25mmd2FoewnrOzprziVfNRH9%25mmd2FVKtFqM2kjlkDOesG4Rrkydj.
[31] LI Y F,LI K W,PAN Y T,et al.A Dynamic Fusion Algorithm of Path Planning Based on Genetic andAnt Colony for Ground Autonomous Combat Robot[J].Journal of Gun Launch & Control,2019(4):42-46,50.
[32] LIU J W,CHANG Z G,DENG H B,et al.Energy-saving operation model for urban rail train based onimproved genetic algorithm[J].Journal of Railway Science and Engineering,2019(11):2881-2888.
[33] CHEN Z X,DONG R X,HAO Y N.Modeling and Optimization of Picking Location Allocation in AutomaticPicking System Based on Improved Genetic Algorithm[J].Industrial Engineering Journal,2019(6):40-44,56.
[34] SHEN W S,ZHAO H C,SUI Y W.Sales Forecasting Model Based on BP Neural Network Optimized by Improved Genetic Algorithms[J].Computer Systems & Applications,2019,(12):200-204.
[35] MO T P,JIN H,SHI K,et al.The Fault Diagnosis of Analog Circuit Based on Wavelet Packet and SGD-XGBoost [J].Microelectronics & Computer,2019(4):38-42.
[1] 高振卓, 王志海, 刘海洋.
嵌入典型时间序列特征的随机Shapelet森林算法
Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features
计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[2] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[3] 杨浩雄, 高晶, 邵恩露.
考虑一单多品的外卖订单配送时间的带时间窗的车辆路径问题
Vehicle Routing Problem with Time Window of Takeaway Food ConsideringOne-order-multi-product Order Delivery
计算机科学, 2022, 49(6A): 191-198. https://doi.org/10.11896/jsjkx.210400005
[4] 阙华坤, 冯小峰, 刘盼龙, 郭文翀, 李健, 曾伟良, 范竞敏.
Grassberger熵随机森林在窃电行为检测的应用
Application of Grassberger Entropy Random Forest to Power-stealing Behavior Detection
计算机科学, 2022, 49(6A): 790-794. https://doi.org/10.11896/jsjkx.210800032
[5] 王文强, 贾星星, 李朋.
自适应的集成定序算法
Adaptive Ensemble Ordering Algorithm
计算机科学, 2022, 49(6A): 242-246. https://doi.org/10.11896/jsjkx.210200108
[6] 孙福权, 梁莹.
基于XGBoost算法的水稻基因组6mA位点识别研究
Identification of 6mA Sites in Rice Genome Based on XGBoost Algorithm
计算机科学, 2022, 49(6A): 309-313. https://doi.org/10.11896/jsjkx.210700262
[7] 李京泰, 王晓丹.
基于代价敏感激活函数XGBoost的不平衡数据分类方法
XGBoost for Imbalanced Data Based on Cost-sensitive Activation Function
计算机科学, 2022, 49(5): 135-143. https://doi.org/10.11896/jsjkx.210400064
[8] 章晓庆, 方建生, 肖尊杰, 陈浜, RisaHIGASHITA, 陈婉, 袁进, 刘江.
基于眼前节相干光断层扫描成像的核性白内障分类算法
Classification Algorithm of Nuclear Cataract Based on Anterior Segment Coherence Tomography Image
计算机科学, 2022, 49(3): 204-210. https://doi.org/10.11896/jsjkx.201100085
[9] 沈彪, 沈立炜, 李弋.
空间众包任务的路径动态调度方法
Dynamic Task Scheduling Method for Space Crowdsourcing
计算机科学, 2022, 49(2): 231-240. https://doi.org/10.11896/jsjkx.210400249
[10] 刘振宇, 宋晓莹.
一种可用于分类型属性数据的多变量回归森林
Multivariate Regression Forest for Categorical Attribute Data
计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189
[11] 杨小琴, 刘国军, 郭建慧, 马文涛.
基于随机森林的空域-频域联合特征全参考彩色图像质量评价方法
Full Reference Color Image Quality Assessment Method Based on Spatial and Frequency Domain Joint Features with Random Forest
计算机科学, 2021, 48(8): 99-105. https://doi.org/10.11896/jsjkx.200700106
[12] 吴善杰, 王新.
基于AGA-DBSCAN优化的RBF神经网络构造煤厚度预测方法
Prediction of Tectonic Coal Thickness Based on AGA-DBSCAN Optimized RBF Neural Networks
计算机科学, 2021, 48(7): 308-315. https://doi.org/10.11896/jsjkx.200800110
[13] 郑建华, 李小敏, 刘双印, 李迪.
融合级联上采样与下采样的改进随机森林不平衡数据分类算法
Improved Random Forest Imbalance Data Classification Algorithm Combining Cascaded Up-sampling and Down-sampling
计算机科学, 2021, 48(7): 145-154. https://doi.org/10.11896/jsjkx.200800120
[14] 陈静杰, 王琨.
不平衡油耗数据的区间预测方法
Interval Prediction Method for Imbalanced Fuel Consumption Data
计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145
[15] 曹扬晨, 朱国胜, 祁小云, 邹洁.
基于随机森林的入侵检测分类研究
Research on Intrusion Detection Classification Based on Random Forest
计算机科学, 2021, 48(6A): 459-463. https://doi.org/10.11896/jsjkx.200600161
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!