计算机科学 ›› 2023, Vol. 50 ›› Issue (1): 253-261.doi: 10.11896/jsjkx.211100167

• 人工智能 • 上一篇    下一篇

基于相似度约束的双策略蒸馏深度强化学习方法

徐平安1, 刘全1,2,3,4   

  1. 1 苏州大学计算机科学与技术学院 江苏 苏州 215006
    2 软件新技术与产业化协同创新中心 南京 210000
    3 吉林大学符号计算与知识工程教育部重点实验室 长春 130012
    4 苏州大学江苏省计算机信息处理技术重点实验室 江苏 苏州 215006
  • 收稿日期:2021-11-16 修回日期:2022-03-19 出版日期:2023-01-15 发布日期:2023-01-09
  • 通讯作者: 刘全(quanliu@suda.edu.cn)
  • 作者简介:paxu@stu.suda.edu.cn
  • 基金资助:
    国家自然科学基金(61772355,61702055);江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18);苏州市应用基础研究计划工业部分(SYG201422);江苏高校优势学科建设工程资助项目

Deep Reinforcement Learning Based on Similarity Constrained Dual Policy Distillation

XU Ping'an1, LIU Quan1,2,3,4   

  1. 1 School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China
    3 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
    4 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
  • Received:2021-11-16 Revised:2022-03-19 Online:2023-01-15 Published:2023-01-09
  • About author:XU Ping'an,born in 1997,postgra-duate.His main research interests include reinforcement learning and deep reinforcement learning.
    LIU Quan,born in 1969,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include deep reinforcement learning and automated reasoning.
  • Supported by:
    National Natural Science Foundation of China(61772355,61702055),Jiangsu Province Natural Science Research University Major Projects(18KJA520011,17KJA520004),Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04,93K172017K18),Suzhou Industrial Application of Basic Research Program Part(SYG201422) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

摘要: 策略蒸馏是一种将知识从一个策略转移到另一个策略的方法,在具有挑战性的强化学习任务中获得了巨大的成功。典型的策略蒸馏方法采用的是师生策略模型,即知识从拥有优秀经验数据的教师策略迁移到学生策略。获得一个教师策略需要耗费大量的计算资源,因此双策略蒸馏框架(Dual Policy Distillation,DPD)被提出,其不再依赖于教师策略,而是维护两个学生策略互相进行知识迁移。然而,若其中一个学生策略无法通过自我学习超越另一个学生策略,或者两个学生策略在蒸馏后趋于一致,则结合DPD的深度强化学习算法会退化为单一策略的梯度优化方法。针对上述问题,给出了学生策略之间相似度的概念,并提出了基于相似度约束的双策略蒸馏框架(Similarity Constrained Dual Policy Distillation,SCDPD)。该框架在知识迁移的过程中,动态地调整两个学生策略间的相似度,从理论上证明了其能够有效提升学生策略的探索性以及算法的稳定性。实验结果表明,将SCDPD与经典的异策略和同策略深度强化学习算法结合的SCDPD-SAC算法和SCDPD-PPO算法,在多个连续控制任务上,相比经典算法具有更好的性能表现。

关键词: 深度强化学习, 策略蒸馏, 相似度约束, 知识迁移, 连续控制任务

Abstract: Policy distillation,a method of transferring knowledge from one policy to another,has achieved great success in challenging reinforcement learning tasks.The typical policy distillation approach uses a teacher-student policy model,where know-ledge is transferred from the teacher policy,which has excellent empirical data,to the student policy.Obtaining a teacher policy is computationally intensive,so dual policy distillation(DPD) framework is proposed,which maintains two student policies to transfer knowledge to each other and no longer depends on the teacher policy.However,if one of the student policies cannot surpass the other through self-learning,or if the two student policies converge after distillation,the deep reinforcement learning algorithm combined with DPD degenerates into a single policy gradient optimization approach.To address the problems mentioned above,the concept of similarity between student policies is given,and the similarity constrained dual policy distillation(SCDPD) framework is proposed.The framework dynamically adjusts the similarity between two students' policies in the process of knowledge transfer,and has been theoretically shown to be effective in enhancing the exploration of students′ policies as well as the stability of algorithms.Experimental results show that the SCDPD-SAC algorithm and SCDPD-PPO algorithm,which combine SCDPD with classical off-policy and on-policy deep reinforcement learning algorithms,have better performance compared with classical algorithms on multiple continuous control tasks.

Key words: Deepre inforcement learning, Policy distillation, Similarity constraint, Knowledge transfer, Continuous control tasks

中图分类号: 

  • TP181
[1]SUTTON R S,BARTO A G.Reinforcement learning:An introduction [M].Massachusetts:MIT press,2018.
[2]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of go without human knowledge [J].Nature,2017,550(7676):354-359.
[3]KOBER J,BAGNELL J A,PETERS J.Reinforcement learning in robotics:A survey [J].The International Journal of Robotics Research,2013,32(11):1238-1274.
[4]SALLAB A E,ABDOU M,PEROT E,et al.Deep reinforcement learning framework for autonomous driving [J].Electronic Imaging,2017,2017(19):70-76.
[5]LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep reinforcement learning [J].Chinese Journal of Computers,2018,41(1):1-27.
[6]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deepreinforcement learning [J].Nature,2015,518(7540):529-533.
[7]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1928-1937.
[8]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[C]//ICLR.2016.
[9]FUJIMOTO S,HOOF H,MEGER D.Addressing function approximation error in actor-critic methods[C]//International Conference on Machine Learning.2018:1587-1596.
[10]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//International Conference on Machine Learning.2015:1889-1897.
[11]SCHULMAN J,MORITZ P,LEVINE S,et al.High-dimen-sional continuous control using generalized advantage estimation[J].arXiv:1506.02438,2015.
[12]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximalpolicy optimization algorithms[J].arXiv:1707.06347,2017.
[13]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//International Conference on Machine Learning.2018:1861-1870.
[14]SALIMANS T,HO J,CHEN X,et al.Evolution strategies as a scalable alternative to reinforcement learning[J].arXiv:1703.03864,2017.
[15]TAO Y,GENC S,CHUNG J,et al.REPAINT:KnowledgeTransfer in Deep Reinforcement Learning[C]//International Conference on Machine Learning.2021:10141-10152.
[16]BARRETO A,BORSA D,QUAN J,et al.Transfer in deep reinforcement learning using successor features and generalised policy improvement[C]//International Conference on Machine Learning.2018:501-510.
[17]CZARNECKI W M,PASCANU R,OSINDERO S,et al.Distilling policy distillation[C]//International Conference on Artificial Intelligence and Statistics.2019:1331-1340.
[18]LAI KH,ZHA D,LI Y,et al.Dual Policy Distillation[C]//International Joint Conference on Artificial Intelligence.2020:3146-3152.
[19]RUSU A A,COLMENAREJO S G,GULCEHRE C,et al.Policy distillation[J].arXiv:1151.06295,2015.
[20]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531,2015.
[21]WADHWANIA S,KIM DK,OMIDSHAFIEI S,et al.Policy distillation and value matching in multiagent reinforcement learning[C]//International Conference on Intelligent Robots and Systems.2019:8193-8200.
[22]CHEN G.A New Framework for Multi-Agent ReinforcementLearning-Centralized Training and Exploration with Decentra-lized Execution via Policy Distillation[C]//International Confe-rence on Autonomous Agents and MultiAgent Systems.2020:1801-1803.
[23]ZHA D,LAI K H,ZHOU K,et al.Experience replay optimization[C]//International Joint Conference on Artificial Intelligence.2019:4243-4249.
[24]XU T,LIU Q,ZHAO L,et al.Learning to explore via meta-po-licy gradient[C]//International Conference on Machine Lear-ning.2018:5463-5472.
[25]FANG Y,REN K,LIU W,et al.Universal Trading for Order Execution with Oracle Policy Distillation[J].arXiv:2103.10860,2021.
[26]FAN S,ZHANG X,SONG Z.Reinforced knowledge distillation:Multi-class imbalanced classifier based on policy gradient reinforcement learning [J].Neurocomputing,2021,463:422-436.
[27]HA J S,PARK Y J,CHAE H J,et al.Distilling a hierarchical policy for planning and control via representation and reinforcement learning[C]//IEEE International Conference on Robotics and Automation.2021:4459-4466.
[28]LI Z H,YU Y,CHEN Y,et al.Neural-to-Tree Policy Distillation with Policy Improvement Criterion[J].arXiv:2108.06898,2021.
[29]ZHAO C,HOSPEDALES T.Robust domain randomised rein-forcement learning through peer-to-peer distillation[C]//Asian Conference on Machine Learning.2021:1237-1252.
[30]CHA H,PARK J,KIM H,et al.Proxy experience replay:Fede-rated distillation for distributed reinforcement learning [J].IEEE Intelligent Systems,2020,35(4):94-101.
[31]SUN H,PAN X,DAI B,et al.Evolutionary Stochastic Policy Distillation[J].arXiv:2004.12909,2020.
[32]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai gym[J].arXiv:1606.01540,2016.
[1] 黄昱洲, 王立松, 秦小麟.
一种基于深度强化学习的无人小车双层路径规划方法
Bi-level Path Planning Method for Unmanned Vehicle Based on Deep Reinforcement Learning
计算机科学, 2023, 50(1): 194-204. https://doi.org/10.11896/jsjkx.220500241
[2] 张启阳, 陈希亮, 张巧.
基于轨迹感知的稀疏奖励探索方法
Sparse Reward Exploration Method Based on Trajectory Perception
计算机科学, 2023, 50(1): 262-269. https://doi.org/10.11896/jsjkx.220700010
[3] 魏楠, 魏祥麟, 范建华, 薛羽, 胡永扬.
面向频谱接入深度强化学习模型的后门攻击方法
Backdoor Attack Against Deep Reinforcement Learning-based Spectrum Access Model
计算机科学, 2023, 50(1): 351-361. https://doi.org/10.11896/jsjkx.220800269
[4] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[5] 于滨, 李学华, 潘春雨, 李娜.
基于深度强化学习的边云协同资源分配算法
Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning
计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219
[6] 唐枫, 冯翔, 虞慧群.
基于自适应知识迁移与资源分配的多任务协同优化算法
Multi-task Cooperative Optimization Algorithm Based on Adaptive Knowledge Transfer andResource Allocation
计算机科学, 2022, 49(7): 254-262. https://doi.org/10.11896/jsjkx.210600184
[7] 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳.
基于深度确定性策略梯度的服务器可靠性任务卸载策略
Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient
计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040
[8] 谢万城, 李斌, 代玥玥.
空中智能反射面辅助边缘计算中基于PPO的任务卸载方案
PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing
计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249
[9] 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄.
基于遗憾探索的竞争网络强化学习智能推荐方法研究
Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration
计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226
[10] 李鹏, 易修文, 齐德康, 段哲文, 李天瑞.
一种基于深度学习的供热策略优化方法
Heating Strategy Optimization Method Based on Deep Learning
计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155
[11] 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮.
基于深度强化学习的无信号灯交叉路口车辆控制
DRL-based Vehicle Control Strategy for Signal-free Intersections
计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010
[12] 蔡岳, 王恩良, 孙哲, 孙知信.
基于双重指针网络的车货匹配双重序列决策研究
Study on Dual Sequence Decision-making for Trucks and Cargo Matching Based on Dual Pointer Network
计算机科学, 2022, 49(11A): 210800257-9. https://doi.org/10.11896/jsjkx.210800257
[13] 代珊珊, 刘全.
基于动作约束深度强化学习的安全自动驾驶方法
Action Constrained Deep Reinforcement Learning Based Safe Automatic Driving Method
计算机科学, 2021, 48(9): 235-243. https://doi.org/10.11896/jsjkx.201000084
[14] 成昭炜, 沈航, 汪悦, 王敏, 白光伟.
基于深度强化学习的无人机辅助弹性视频多播机制
Deep Reinforcement Learning Based UAV Assisted SVC Video Multicast
计算机科学, 2021, 48(9): 271-277. https://doi.org/10.11896/jsjkx.201000078
[15] 周仕承, 刘京菊, 钟晓峰, 卢灿举.
基于深度强化学习的智能化渗透测试路径发现
Intelligent Penetration Testing Path Discovery Based on Deep Reinforcement Learning
计算机科学, 2021, 48(7): 40-46. https://doi.org/10.11896/jsjkx.210400057
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!