计算机科学 ›› 2026, Vol. 53 ›› Issue (4): 235-244.doi: 10.11896/jsjkx.250600043

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于深度强化学习的长期因果效应估计

柳家起1,2, 汪玉杰1,2, 相国督1,2, 俞奎1,2, 曹付元3   

  1. 1 合肥工业大学计算机与信息学院 合肥 230601
    2 大数据知识工程教育部重点实验室(合肥工业大学) 合肥 230601
    3 山西大学计算机与信息技术学院(大数据学院) 太原 030006
  • 收稿日期:2025-06-08 修回日期:2025-11-02 出版日期:2026-04-15 发布日期:2026-04-08
  • 通讯作者: 俞奎(yukui@hfut.edu.cn)
  • 作者简介:(liujiaqi@mail.hfut.edu.cn)
  • 基金资助:
    国家科技重大专项(2021ZD0111801);国家自然科学基金(62376087)

Long-term Causal Effect Estimation Based on Deep Reinforcement Learning

LIU Jiaqi1,2, WANG Yujie1,2, XIANG Guodu1,2, YU Kui1,2, CAO Fuyuan3   

  1. 1 School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China
    2 Key Laboratory of Knowledge Engineering with Big Data(the Ministry of Education of China), Hefei University of Technology, Hefei 230601, China
    3 School of Computer and Information Technology(School of Big Data), Shanxi University, Taiyuan 030006, China
  • Received:2025-06-08 Revised:2025-11-02 Published:2026-04-15 Online:2026-04-08
  • About author:LIU Jiaqi,born in 2003,postgraduate.His main research interests include causal effect estimation and reinforcement learning.
    YU Kui,born in 1977,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.14259M).His main research interest is causal inference.
  • Supported by:
    National Science and Technology Major Project of the Ministry of Science and Technology of China(2021ZD0111801) and National Natural Science Foundation of China(62376087).

摘要: 因果效应估计旨在计算处理变量对结果变量的因果作用大小。现有主流因果效应估计方法主要适用于静态数据或时间序列中的单个时间点,无法有效估计处理变量在长期时间内对结果变量产生的累积影响。为解决这一问题,基于传统强化学习的长期因果效应估计方法通过线性基函数来拟合长期潜在结果,从而计算长期因果效应。然而,由于线性基函数在复杂场景下的表达能力有限,现有方法不能准确识别弱因果效应,同时在数据维度提高时会出现明显的性能退化问题。针对上述问题,提出了一种基于深度强化学习的长期因果效应估计方法。该方法采用对决网络估计长期潜在结果,能够有效估计处理变量对结果变量的影响,从而大幅提升算法对弱因果效应的识别能力;同时,所提方法避免了基函数选择不当而导致估计长期潜在结果时出现的偏差。实验结果表明,所提方法在统计学合成数据集和订单调度模拟数据集上优于现有算法。

关键词: 长期因果效应估计, 潜在结果模型, 深度强化学习

Abstract: Causal effect estimation aims to calculate the magnitude of the causal effect of the treatment variable on the outcome variable.The existing prevalent causal effect estimation methods are mainly applicable to static data or a single time point in time series,and cannot effectively estimate the cumulative impact of the treatment variable on the outcome variable over a long period of time.To solve this problem,the long-term causal effect estimation method based on traditional reinforcement learning fits the long-term potential outcomes through linear basis functions,thereby calculating the long-term causal effect.However,due to the limited expressive power of linear basis functions in complex scenarios,existing methods cannot accurately identify weak causal effects,and at the same time,there will be significant performance degradation problems when the data dimension increases.In response to the above problems,this paper proposes a long-term causal effect estimation method based on deep reinforcement lear-ning.This method uses the dueling network to estimate long-term potential outcomes,which can effectively estimate the impact of the treatment variable on the outcome variable,thereby greatly improving the algorithm’s ability to identify weak causal effects.Meanwhile,the proposed method avoids the biases that occur when estimating long-term potential outcomes due to improper selection of basis functions.Experimental results show that the proposed method outperforms existing algorithms on statistical synthetic datasets and order scheduling simulation datasets.

Key words: Long-term causal effect estimation, Potential outcome model, Deep reinforcement learning

中图分类号: 

  • TP181
[1]KESSLER R C,BOSSARTE R M,LUEDTKE A,et al.Machine learning methods for developing precision treatment rules with observational data[J].Behaviour Research and Therapy,2019,120:103412.
[2]ASSAEL H,ISHIHARA M,KIM B J.Accounting for causality when measuring sales lift from television advertising:Television campaigns are shown to be more effective for lighter brand users[J].Journal of Advertising Research,2021,61(1):3-11.
[3]SHALIT U.Can we learn individual-level treatment policiesfrom clinical data?[J].Biostatistics,2020,21(2):359-362.
[4]RUBIN D B.Estimating causal effects of treatments in randomi-zed and nonrandomized studies[J].Journal of Educational Psychology,1974,66(5):688.
[5]PEARL J,MACKENZIE D.The book of why:the new science of cause and effect[M]//Basic Books.2018.
[6]WU A P,YUAN J K,KUANG K,et al.Learning decomposed representations for treatment effect estimation[J].IEEE Transactions on Knowledge and Data Engineering,2022,35(5):4989-5001.
[7]CAO D,ENOUEN J,WANG Y,et al.Estimating treatmenteffects from irregular time series observations with hidden confounders[C]//Proceedings of the 37th AAAI Conference on Artificial Intelligence.2023:6897-6905.
[8]FOUGÈRE D,JACQUEMET N.Policy evaluation using causal inference methods[M]//Handbook of Research Methods and Applications in Empirical Microeconomics.Edward Elgar Publishing,2021:294-324.
[9]SHI C C,WANG X Y,LUO S K,et al.Dynamic causaleffects evaluation in a/b testing with a reinforcement learning framework[J].Journal of the American Statistical Association,2023,118(543):2059-2071.
[10]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M]//A Bradford Book.2018.
[11]ROSENBAUM P R,RUBIN D B.The central role of the propensity score in observational studies for causal effects[J].Bio-metrika,1983,70(1):41-55.
[12]RUBIN D B.Matching to remove bias in observational studies[J].Biometrics,1973,29(1):159-183.
[13]HORVITZ D G,THOMPSON D J.A generalization of sampling without replacement from a finite universe[J].Journal of the American Statistical Association,1952,47(260):663-685.
[14]SCHARFSTEIN D O,ROTNITZKY A,ROBINS J M.Adjusting for nonignorable drop-out using semiparametric nonresponse models[J].Journal of the American Statistical Association,1999,94(448):1096-1120.
[15]SHALIT U,JOHANSSON F D,SONTAG D.Estimating Individual Treatment Effect:Generalization Bounds and Algorithms[C]//Proceedings of the 34th International Conference on Machine Learning.2017:3076-3085.
[16]LOUIZOS C,SHALIT U,MOOIJ J M,et al.Causal Effect Inference with Deep Latent-Variable Models[C]//Proceedings of the 31st Conference on Neural Information Processing Systems.2017:6446-6456.
[17]YAO L Y,LI S,LI Y L,et al.Representation Learning forTreatment Effect Estimation from Observational Data[C]//Proceedings of the 32nd Conference on Neural Information Processing Systems.2018:2638-2648.
[18]YOON J,JORDON J,VANDERSCHAAR M.Ganite:Estimation of Individualized Treatment Effects Using Generative Adversarial Nets[C]//Proceedings of the 6th International Conference on Learning Representations.2018:50-60.
[19]WANG H,CHEN Z C,FAN J J,et al.Optimal transport fortreatment effect estimation[C]//Proceedings of the 38th Conference on Neural Information Processing Systems.2024:1-21.
[20]ROBINS J.A new approach to causal inference in mortalitystudies with a sustained exposure period-application to control of the healthy worker survivor effect[J].Mathematical Modelling,1986,7(9/10/11/12):1393-1512.
[21]XU Y,XU Y,SARIA S.A non-parametric bayesian approach for estimating treatment-response curves from sparse time series[C]//Proceedings of the 1st Machine Learning for Healthcare.2016:282-300.
[22]QIAN Z Z,ZHANG Y,BICA I,et al.Synctwin:Treatmenteffect estimation with longitudinal outcomes[C]//Proceedings of the 35th Conference on Neural Information Processing Systems.2021:3178-3190.
[23]LIM B,ALAA A,VAN DER SCHAAR M.Forecasting treatment responses over time using recurrent marginal structural networks[C]//Proceedings of the 32nd Conference on Neural Information Processing Systems.2018:7494-7504.
[24]BICA I,ALAA A M,JORDON J,et al.Estimating counterfactual treatment outcomes over time through adversarially balanced representations[J].arXiv:2002.04083,2020.
[25]LI R,HU S,LU M Y,et al.G-Net:a recurrent network approach to g-computation for counterfactual prediction under a dynamic treatment regime[C]//Proceedings of Machine Lear-ning Research.2021:282-299.
[26]MELNYCHUK V,FRAUEN D,FEUERRIEGEL S.Causaltransformer for estimating counterfactual outcomes[C]//Proceedings of the 39th International Conference on Machine Learning.2022:15293-15329.
[27]ROBINS J M.Optimal structural nested models for optimal sequential decisions[C]//Proceedings of the 2nd Seattle Symposium in Biostatistics:Analysis of Correlated Data.2004:189-326.
[28]WANG Z Y,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[C]//Proceedings of the 33th International Conference on Machine Learning.2016:1995-2003.
[29]BELLMAN R.Dynamic programming[J].Science,1966,153(3731):34-37.
[1] 李芳, 袁宝淳, 沈航, 王天荆, 白光伟.
低轨卫星网络中基于深度强化学习的航空器任务卸载策略
Deep Reinforcement Learning-based Aircraft Task Offloading in Low Earth Orbit Satellite Networks
计算机科学, 2026, 53(2): 406-415. https://doi.org/10.11896/jsjkx.250200092
[2] 王皓焱, 李崇寿, 李天瑞.
基于双层注意力网络的强化学习方法求解柔性作业车间调度问题
Reinforcement Learning Method for Solving Flexible Job Shop Scheduling Problem Based onDouble Layer Attention Network
计算机科学, 2026, 53(1): 231-240. https://doi.org/10.11896/jsjkx.250100088
[3] 周德强, 季新生, 游伟, 邱航, 杨杰.
攻击图辅助下基于深度强化学习的服务功能链攻击恢复方法
Attack Graph-assisted Deep Reinforcement Learning-based Service Function Chain AttackRecovery Method
计算机科学, 2026, 53(1): 371-381. https://doi.org/10.11896/jsjkx.250300076
[4] 陈锦韬, 林兵, 林崧, 陈静, 陈星.
基于多智能体深度强化学习的光储充电站动态定价及能源调度策略
Dynamic Pricing and Energy Scheduling Strategy for Photovoltaic Storage Charging Stations Based on Multi-agent Deep Reinforcement Learning
计算机科学, 2025, 52(9): 337-345. https://doi.org/10.11896/jsjkx.240700197
[5] 张永良, 李子文, 许家豪, 江雨宸, 崔滢.
基于拥塞感知和缓存通信的多智能体路径规划
Congestion-aware and Cached Communication for Multi-agent Pathfinding
计算机科学, 2025, 52(8): 317-325. https://doi.org/10.11896/jsjkx.240900012
[6] 霍丹, 余付平, 沈堤, 韩雪艳.
基于深度强化学习的多机冲突解决方法的研究
Research on Multi-machine Conflict Resolution Based on Deep Reinforcement Learning
计算机科学, 2025, 52(7): 271-278. https://doi.org/10.11896/jsjkx.240800133
[7] 吴宗明, 曹继军, 汤强.
基于深度强化学习的在线并行SDN路由优化算法研究
Online Parallel SDN Routing Optimization Algorithm Based on Deep Reinforcement Learning
计算机科学, 2025, 52(6A): 240900018-9. https://doi.org/10.11896/jsjkx.240900018
[8] 王晨源, 张艳梅, 袁冠.
融合深度强化学习和图卷积神经网络的类集成测试序列生成方法
Class Integration Test Order Generation Approach Fused with Deep Reinforcement Learning andGraph Convolutional Neural Network
计算机科学, 2025, 52(6): 58-65. https://doi.org/10.11896/jsjkx.240700115
[9] 赵学健, 叶昊, 李豪, 孙知信.
基于改进DDPG的多AGV路径规划算法
Multi-AGV Path Planning Algorithm Based on Improved DDPG
计算机科学, 2025, 52(6): 306-315. https://doi.org/10.11896/jsjkx.240500099
[10] 李远博, 扈红超, 杨晓晗, 郭威, 刘文彦.
基于深度强化学习的微服务工作流容侵调度算法
Intrusion Tolerance Scheduling Algorithm for Microservice Workflow Based on Deep Reinforcement Learning
计算机科学, 2025, 52(5): 375-383. https://doi.org/10.11896/jsjkx.240500033
[11] 郑龙海, 肖博怀, 姚泽玮, 陈星, 莫毓昌.
基于图强化学习的多边缘协同负载均衡方法
Graph Reinforcement Learning Based Multi-edge Cooperative Load Balancing Method
计算机科学, 2025, 52(3): 338-348. https://doi.org/10.11896/jsjkx.240100091
[12] 杜立宽, 刘晨, 王俊陆, 宋宝燕.
自学习星型链空间自适应分配方法
Self-learning Star Chain Space Adaptive Allocation Method
计算机科学, 2025, 52(3): 359-365. https://doi.org/10.11896/jsjkx.240700140
[13] 霍兴鹏, 沙乐天, 刘建文, 吴尚, 苏子悦.
基于深度强化学习的Windows域渗透攻击路径生成方法
Windows Domain Penetration Testing Attack Path Generation Based on Deep Reinforcement Learning
计算机科学, 2025, 52(3): 400-406. https://doi.org/10.11896/jsjkx.231200074
[14] 徐东红, 李彬, 齐勇.
面向云数据中心基于改进A2C算法的任务调度策略
Task Scheduling Strategy Based on Improved A2C Algorithm for Cloud Data Center
计算机科学, 2025, 52(2): 310-322. https://doi.org/10.11896/jsjkx.240500111
[15] 彭俊龙, 范静.
利用融合2-opt的强化学习算法求解TSP问题
Hybrid Reinforcement Learning Algorithm Combined with 2-opt for Solving Traveling Salesman Problem
计算机科学, 2025, 52(11A): 250200121-8. https://doi.org/10.11896/jsjkx.250200121
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!