计算机科学 ›› 2026, Vol. 53 ›› Issue (4): 235-244.doi: 10.11896/jsjkx.250600043
柳家起1,2, 汪玉杰1,2, 相国督1,2, 俞奎1,2, 曹付元3
LIU Jiaqi1,2, WANG Yujie1,2, XIANG Guodu1,2, YU Kui1,2, CAO Fuyuan3
摘要: 因果效应估计旨在计算处理变量对结果变量的因果作用大小。现有主流因果效应估计方法主要适用于静态数据或时间序列中的单个时间点,无法有效估计处理变量在长期时间内对结果变量产生的累积影响。为解决这一问题,基于传统强化学习的长期因果效应估计方法通过线性基函数来拟合长期潜在结果,从而计算长期因果效应。然而,由于线性基函数在复杂场景下的表达能力有限,现有方法不能准确识别弱因果效应,同时在数据维度提高时会出现明显的性能退化问题。针对上述问题,提出了一种基于深度强化学习的长期因果效应估计方法。该方法采用对决网络估计长期潜在结果,能够有效估计处理变量对结果变量的影响,从而大幅提升算法对弱因果效应的识别能力;同时,所提方法避免了基函数选择不当而导致估计长期潜在结果时出现的偏差。实验结果表明,所提方法在统计学合成数据集和订单调度模拟数据集上优于现有算法。
中图分类号:
| [1]KESSLER R C,BOSSARTE R M,LUEDTKE A,et al.Machine learning methods for developing precision treatment rules with observational data[J].Behaviour Research and Therapy,2019,120:103412. [2]ASSAEL H,ISHIHARA M,KIM B J.Accounting for causality when measuring sales lift from television advertising:Television campaigns are shown to be more effective for lighter brand users[J].Journal of Advertising Research,2021,61(1):3-11. [3]SHALIT U.Can we learn individual-level treatment policiesfrom clinical data?[J].Biostatistics,2020,21(2):359-362. [4]RUBIN D B.Estimating causal effects of treatments in randomi-zed and nonrandomized studies[J].Journal of Educational Psychology,1974,66(5):688. [5]PEARL J,MACKENZIE D.The book of why:the new science of cause and effect[M]//Basic Books.2018. [6]WU A P,YUAN J K,KUANG K,et al.Learning decomposed representations for treatment effect estimation[J].IEEE Transactions on Knowledge and Data Engineering,2022,35(5):4989-5001. [7]CAO D,ENOUEN J,WANG Y,et al.Estimating treatmenteffects from irregular time series observations with hidden confounders[C]//Proceedings of the 37th AAAI Conference on Artificial Intelligence.2023:6897-6905. [8]FOUGÈRE D,JACQUEMET N.Policy evaluation using causal inference methods[M]//Handbook of Research Methods and Applications in Empirical Microeconomics.Edward Elgar Publishing,2021:294-324. [9]SHI C C,WANG X Y,LUO S K,et al.Dynamic causaleffects evaluation in a/b testing with a reinforcement learning framework[J].Journal of the American Statistical Association,2023,118(543):2059-2071. [10]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M]//A Bradford Book.2018. [11]ROSENBAUM P R,RUBIN D B.The central role of the propensity score in observational studies for causal effects[J].Bio-metrika,1983,70(1):41-55. [12]RUBIN D B.Matching to remove bias in observational studies[J].Biometrics,1973,29(1):159-183. [13]HORVITZ D G,THOMPSON D J.A generalization of sampling without replacement from a finite universe[J].Journal of the American Statistical Association,1952,47(260):663-685. [14]SCHARFSTEIN D O,ROTNITZKY A,ROBINS J M.Adjusting for nonignorable drop-out using semiparametric nonresponse models[J].Journal of the American Statistical Association,1999,94(448):1096-1120. [15]SHALIT U,JOHANSSON F D,SONTAG D.Estimating Individual Treatment Effect:Generalization Bounds and Algorithms[C]//Proceedings of the 34th International Conference on Machine Learning.2017:3076-3085. [16]LOUIZOS C,SHALIT U,MOOIJ J M,et al.Causal Effect Inference with Deep Latent-Variable Models[C]//Proceedings of the 31st Conference on Neural Information Processing Systems.2017:6446-6456. [17]YAO L Y,LI S,LI Y L,et al.Representation Learning forTreatment Effect Estimation from Observational Data[C]//Proceedings of the 32nd Conference on Neural Information Processing Systems.2018:2638-2648. [18]YOON J,JORDON J,VANDERSCHAAR M.Ganite:Estimation of Individualized Treatment Effects Using Generative Adversarial Nets[C]//Proceedings of the 6th International Conference on Learning Representations.2018:50-60. [19]WANG H,CHEN Z C,FAN J J,et al.Optimal transport fortreatment effect estimation[C]//Proceedings of the 38th Conference on Neural Information Processing Systems.2024:1-21. [20]ROBINS J.A new approach to causal inference in mortalitystudies with a sustained exposure period-application to control of the healthy worker survivor effect[J].Mathematical Modelling,1986,7(9/10/11/12):1393-1512. [21]XU Y,XU Y,SARIA S.A non-parametric bayesian approach for estimating treatment-response curves from sparse time series[C]//Proceedings of the 1st Machine Learning for Healthcare.2016:282-300. [22]QIAN Z Z,ZHANG Y,BICA I,et al.Synctwin:Treatmenteffect estimation with longitudinal outcomes[C]//Proceedings of the 35th Conference on Neural Information Processing Systems.2021:3178-3190. [23]LIM B,ALAA A,VAN DER SCHAAR M.Forecasting treatment responses over time using recurrent marginal structural networks[C]//Proceedings of the 32nd Conference on Neural Information Processing Systems.2018:7494-7504. [24]BICA I,ALAA A M,JORDON J,et al.Estimating counterfactual treatment outcomes over time through adversarially balanced representations[J].arXiv:2002.04083,2020. [25]LI R,HU S,LU M Y,et al.G-Net:a recurrent network approach to g-computation for counterfactual prediction under a dynamic treatment regime[C]//Proceedings of Machine Lear-ning Research.2021:282-299. [26]MELNYCHUK V,FRAUEN D,FEUERRIEGEL S.Causaltransformer for estimating counterfactual outcomes[C]//Proceedings of the 39th International Conference on Machine Learning.2022:15293-15329. [27]ROBINS J M.Optimal structural nested models for optimal sequential decisions[C]//Proceedings of the 2nd Seattle Symposium in Biostatistics:Analysis of Correlated Data.2004:189-326. [28]WANG Z Y,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[C]//Proceedings of the 33th International Conference on Machine Learning.2016:1995-2003. [29]BELLMAN R.Dynamic programming[J].Science,1966,153(3731):34-37. |
| [1] | 李芳, 袁宝淳, 沈航, 王天荆, 白光伟. 低轨卫星网络中基于深度强化学习的航空器任务卸载策略 Deep Reinforcement Learning-based Aircraft Task Offloading in Low Earth Orbit Satellite Networks 计算机科学, 2026, 53(2): 406-415. https://doi.org/10.11896/jsjkx.250200092 |
| [2] | 王皓焱, 李崇寿, 李天瑞. 基于双层注意力网络的强化学习方法求解柔性作业车间调度问题 Reinforcement Learning Method for Solving Flexible Job Shop Scheduling Problem Based onDouble Layer Attention Network 计算机科学, 2026, 53(1): 231-240. https://doi.org/10.11896/jsjkx.250100088 |
| [3] | 周德强, 季新生, 游伟, 邱航, 杨杰. 攻击图辅助下基于深度强化学习的服务功能链攻击恢复方法 Attack Graph-assisted Deep Reinforcement Learning-based Service Function Chain AttackRecovery Method 计算机科学, 2026, 53(1): 371-381. https://doi.org/10.11896/jsjkx.250300076 |
| [4] | 陈锦韬, 林兵, 林崧, 陈静, 陈星. 基于多智能体深度强化学习的光储充电站动态定价及能源调度策略 Dynamic Pricing and Energy Scheduling Strategy for Photovoltaic Storage Charging Stations Based on Multi-agent Deep Reinforcement Learning 计算机科学, 2025, 52(9): 337-345. https://doi.org/10.11896/jsjkx.240700197 |
| [5] | 张永良, 李子文, 许家豪, 江雨宸, 崔滢. 基于拥塞感知和缓存通信的多智能体路径规划 Congestion-aware and Cached Communication for Multi-agent Pathfinding 计算机科学, 2025, 52(8): 317-325. https://doi.org/10.11896/jsjkx.240900012 |
| [6] | 霍丹, 余付平, 沈堤, 韩雪艳. 基于深度强化学习的多机冲突解决方法的研究 Research on Multi-machine Conflict Resolution Based on Deep Reinforcement Learning 计算机科学, 2025, 52(7): 271-278. https://doi.org/10.11896/jsjkx.240800133 |
| [7] | 吴宗明, 曹继军, 汤强. 基于深度强化学习的在线并行SDN路由优化算法研究 Online Parallel SDN Routing Optimization Algorithm Based on Deep Reinforcement Learning 计算机科学, 2025, 52(6A): 240900018-9. https://doi.org/10.11896/jsjkx.240900018 |
| [8] | 王晨源, 张艳梅, 袁冠. 融合深度强化学习和图卷积神经网络的类集成测试序列生成方法 Class Integration Test Order Generation Approach Fused with Deep Reinforcement Learning andGraph Convolutional Neural Network 计算机科学, 2025, 52(6): 58-65. https://doi.org/10.11896/jsjkx.240700115 |
| [9] | 赵学健, 叶昊, 李豪, 孙知信. 基于改进DDPG的多AGV路径规划算法 Multi-AGV Path Planning Algorithm Based on Improved DDPG 计算机科学, 2025, 52(6): 306-315. https://doi.org/10.11896/jsjkx.240500099 |
| [10] | 李远博, 扈红超, 杨晓晗, 郭威, 刘文彦. 基于深度强化学习的微服务工作流容侵调度算法 Intrusion Tolerance Scheduling Algorithm for Microservice Workflow Based on Deep Reinforcement Learning 计算机科学, 2025, 52(5): 375-383. https://doi.org/10.11896/jsjkx.240500033 |
| [11] | 郑龙海, 肖博怀, 姚泽玮, 陈星, 莫毓昌. 基于图强化学习的多边缘协同负载均衡方法 Graph Reinforcement Learning Based Multi-edge Cooperative Load Balancing Method 计算机科学, 2025, 52(3): 338-348. https://doi.org/10.11896/jsjkx.240100091 |
| [12] | 杜立宽, 刘晨, 王俊陆, 宋宝燕. 自学习星型链空间自适应分配方法 Self-learning Star Chain Space Adaptive Allocation Method 计算机科学, 2025, 52(3): 359-365. https://doi.org/10.11896/jsjkx.240700140 |
| [13] | 霍兴鹏, 沙乐天, 刘建文, 吴尚, 苏子悦. 基于深度强化学习的Windows域渗透攻击路径生成方法 Windows Domain Penetration Testing Attack Path Generation Based on Deep Reinforcement Learning 计算机科学, 2025, 52(3): 400-406. https://doi.org/10.11896/jsjkx.231200074 |
| [14] | 徐东红, 李彬, 齐勇. 面向云数据中心基于改进A2C算法的任务调度策略 Task Scheduling Strategy Based on Improved A2C Algorithm for Cloud Data Center 计算机科学, 2025, 52(2): 310-322. https://doi.org/10.11896/jsjkx.240500111 |
| [15] | 彭俊龙, 范静. 利用融合2-opt的强化学习算法求解TSP问题 Hybrid Reinforcement Learning Algorithm Combined with 2-opt for Solving Traveling Salesman Problem 计算机科学, 2025, 52(11A): 250200121-8. https://doi.org/10.11896/jsjkx.250200121 |
|
||