计算机科学 ›› 2023, Vol. 50 ›› Issue (1): 253-261.doi: 10.11896/jsjkx.211100167
徐平安1, 刘全1,2,3,4
XU Ping'an1, LIU Quan1,2,3,4
摘要: 策略蒸馏是一种将知识从一个策略转移到另一个策略的方法,在具有挑战性的强化学习任务中获得了巨大的成功。典型的策略蒸馏方法采用的是师生策略模型,即知识从拥有优秀经验数据的教师策略迁移到学生策略。获得一个教师策略需要耗费大量的计算资源,因此双策略蒸馏框架(Dual Policy Distillation,DPD)被提出,其不再依赖于教师策略,而是维护两个学生策略互相进行知识迁移。然而,若其中一个学生策略无法通过自我学习超越另一个学生策略,或者两个学生策略在蒸馏后趋于一致,则结合DPD的深度强化学习算法会退化为单一策略的梯度优化方法。针对上述问题,给出了学生策略之间相似度的概念,并提出了基于相似度约束的双策略蒸馏框架(Similarity Constrained Dual Policy Distillation,SCDPD)。该框架在知识迁移的过程中,动态地调整两个学生策略间的相似度,从理论上证明了其能够有效提升学生策略的探索性以及算法的稳定性。实验结果表明,将SCDPD与经典的异策略和同策略深度强化学习算法结合的SCDPD-SAC算法和SCDPD-PPO算法,在多个连续控制任务上,相比经典算法具有更好的性能表现。
中图分类号:
[1]SUTTON R S,BARTO A G.Reinforcement learning:An introduction [M].Massachusetts:MIT press,2018. [2]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of go without human knowledge [J].Nature,2017,550(7676):354-359. [3]KOBER J,BAGNELL J A,PETERS J.Reinforcement learning in robotics:A survey [J].The International Journal of Robotics Research,2013,32(11):1238-1274. [4]SALLAB A E,ABDOU M,PEROT E,et al.Deep reinforcement learning framework for autonomous driving [J].Electronic Imaging,2017,2017(19):70-76. [5]LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep reinforcement learning [J].Chinese Journal of Computers,2018,41(1):1-27. [6]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deepreinforcement learning [J].Nature,2015,518(7540):529-533. [7]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1928-1937. [8]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[C]//ICLR.2016. [9]FUJIMOTO S,HOOF H,MEGER D.Addressing function approximation error in actor-critic methods[C]//International Conference on Machine Learning.2018:1587-1596. [10]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//International Conference on Machine Learning.2015:1889-1897. [11]SCHULMAN J,MORITZ P,LEVINE S,et al.High-dimen-sional continuous control using generalized advantage estimation[J].arXiv:1506.02438,2015. [12]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximalpolicy optimization algorithms[J].arXiv:1707.06347,2017. [13]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//International Conference on Machine Learning.2018:1861-1870. [14]SALIMANS T,HO J,CHEN X,et al.Evolution strategies as a scalable alternative to reinforcement learning[J].arXiv:1703.03864,2017. [15]TAO Y,GENC S,CHUNG J,et al.REPAINT:KnowledgeTransfer in Deep Reinforcement Learning[C]//International Conference on Machine Learning.2021:10141-10152. [16]BARRETO A,BORSA D,QUAN J,et al.Transfer in deep reinforcement learning using successor features and generalised policy improvement[C]//International Conference on Machine Learning.2018:501-510. [17]CZARNECKI W M,PASCANU R,OSINDERO S,et al.Distilling policy distillation[C]//International Conference on Artificial Intelligence and Statistics.2019:1331-1340. [18]LAI KH,ZHA D,LI Y,et al.Dual Policy Distillation[C]//International Joint Conference on Artificial Intelligence.2020:3146-3152. [19]RUSU A A,COLMENAREJO S G,GULCEHRE C,et al.Policy distillation[J].arXiv:1151.06295,2015. [20]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531,2015. [21]WADHWANIA S,KIM DK,OMIDSHAFIEI S,et al.Policy distillation and value matching in multiagent reinforcement learning[C]//International Conference on Intelligent Robots and Systems.2019:8193-8200. [22]CHEN G.A New Framework for Multi-Agent ReinforcementLearning-Centralized Training and Exploration with Decentra-lized Execution via Policy Distillation[C]//International Confe-rence on Autonomous Agents and MultiAgent Systems.2020:1801-1803. [23]ZHA D,LAI K H,ZHOU K,et al.Experience replay optimization[C]//International Joint Conference on Artificial Intelligence.2019:4243-4249. [24]XU T,LIU Q,ZHAO L,et al.Learning to explore via meta-po-licy gradient[C]//International Conference on Machine Lear-ning.2018:5463-5472. [25]FANG Y,REN K,LIU W,et al.Universal Trading for Order Execution with Oracle Policy Distillation[J].arXiv:2103.10860,2021. [26]FAN S,ZHANG X,SONG Z.Reinforced knowledge distillation:Multi-class imbalanced classifier based on policy gradient reinforcement learning [J].Neurocomputing,2021,463:422-436. [27]HA J S,PARK Y J,CHAE H J,et al.Distilling a hierarchical policy for planning and control via representation and reinforcement learning[C]//IEEE International Conference on Robotics and Automation.2021:4459-4466. [28]LI Z H,YU Y,CHEN Y,et al.Neural-to-Tree Policy Distillation with Policy Improvement Criterion[J].arXiv:2108.06898,2021. [29]ZHAO C,HOSPEDALES T.Robust domain randomised rein-forcement learning through peer-to-peer distillation[C]//Asian Conference on Machine Learning.2021:1237-1252. [30]CHA H,PARK J,KIM H,et al.Proxy experience replay:Fede-rated distillation for distributed reinforcement learning [J].IEEE Intelligent Systems,2020,35(4):94-101. [31]SUN H,PAN X,DAI B,et al.Evolutionary Stochastic Policy Distillation[J].arXiv:2004.12909,2020. [32]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai gym[J].arXiv:1606.01540,2016. |
[1] | 黄昱洲, 王立松, 秦小麟. 一种基于深度强化学习的无人小车双层路径规划方法 Bi-level Path Planning Method for Unmanned Vehicle Based on Deep Reinforcement Learning 计算机科学, 2023, 50(1): 194-204. https://doi.org/10.11896/jsjkx.220500241 |
[2] | 张启阳, 陈希亮, 张巧. 基于轨迹感知的稀疏奖励探索方法 Sparse Reward Exploration Method Based on Trajectory Perception 计算机科学, 2023, 50(1): 262-269. https://doi.org/10.11896/jsjkx.220700010 |
[3] | 魏楠, 魏祥麟, 范建华, 薛羽, 胡永扬. 面向频谱接入深度强化学习模型的后门攻击方法 Backdoor Attack Against Deep Reinforcement Learning-based Spectrum Access Model 计算机科学, 2023, 50(1): 351-361. https://doi.org/10.11896/jsjkx.220800269 |
[4] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 |
[5] | 于滨, 李学华, 潘春雨, 李娜. 基于深度强化学习的边云协同资源分配算法 Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning 计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219 |
[6] | 唐枫, 冯翔, 虞慧群. 基于自适应知识迁移与资源分配的多任务协同优化算法 Multi-task Cooperative Optimization Algorithm Based on Adaptive Knowledge Transfer andResource Allocation 计算机科学, 2022, 49(7): 254-262. https://doi.org/10.11896/jsjkx.210600184 |
[7] | 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳. 基于深度确定性策略梯度的服务器可靠性任务卸载策略 Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient 计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040 |
[8] | 谢万城, 李斌, 代玥玥. 空中智能反射面辅助边缘计算中基于PPO的任务卸载方案 PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing 计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249 |
[9] | 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226 |
[10] | 李鹏, 易修文, 齐德康, 段哲文, 李天瑞. 一种基于深度学习的供热策略优化方法 Heating Strategy Optimization Method Based on Deep Learning 计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155 |
[11] | 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮. 基于深度强化学习的无信号灯交叉路口车辆控制 DRL-based Vehicle Control Strategy for Signal-free Intersections 计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010 |
[12] | 蔡岳, 王恩良, 孙哲, 孙知信. 基于双重指针网络的车货匹配双重序列决策研究 Study on Dual Sequence Decision-making for Trucks and Cargo Matching Based on Dual Pointer Network 计算机科学, 2022, 49(11A): 210800257-9. https://doi.org/10.11896/jsjkx.210800257 |
[13] | 代珊珊, 刘全. 基于动作约束深度强化学习的安全自动驾驶方法 Action Constrained Deep Reinforcement Learning Based Safe Automatic Driving Method 计算机科学, 2021, 48(9): 235-243. https://doi.org/10.11896/jsjkx.201000084 |
[14] | 成昭炜, 沈航, 汪悦, 王敏, 白光伟. 基于深度强化学习的无人机辅助弹性视频多播机制 Deep Reinforcement Learning Based UAV Assisted SVC Video Multicast 计算机科学, 2021, 48(9): 271-277. https://doi.org/10.11896/jsjkx.201000078 |
[15] | 周仕承, 刘京菊, 钟晓峰, 卢灿举. 基于深度强化学习的智能化渗透测试路径发现 Intelligent Penetration Testing Path Discovery Based on Deep Reinforcement Learning 计算机科学, 2021, 48(7): 40-46. https://doi.org/10.11896/jsjkx.210400057 |
|