Computer Science ›› 2023, Vol. 50 ›› Issue (1): 253-261.doi: 10.11896/jsjkx.211100167

• Artificial Intelligence • Previous Articles     Next Articles

Deep Reinforcement Learning Based on Similarity Constrained Dual Policy Distillation

XU Ping'an1, LIU Quan1,2,3,4   

  1. 1 School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China
    3 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
    4 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
  • Received:2021-11-16 Revised:2022-03-19 Online:2023-01-15 Published:2023-01-09
  • About author:XU Ping'an,born in 1997,postgra-duate.His main research interests include reinforcement learning and deep reinforcement learning.
    LIU Quan,born in 1969,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include deep reinforcement learning and automated reasoning.
  • Supported by:
    National Natural Science Foundation of China(61772355,61702055),Jiangsu Province Natural Science Research University Major Projects(18KJA520011,17KJA520004),Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04,93K172017K18),Suzhou Industrial Application of Basic Research Program Part(SYG201422) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Abstract: Policy distillation,a method of transferring knowledge from one policy to another,has achieved great success in challenging reinforcement learning tasks.The typical policy distillation approach uses a teacher-student policy model,where know-ledge is transferred from the teacher policy,which has excellent empirical data,to the student policy.Obtaining a teacher policy is computationally intensive,so dual policy distillation(DPD) framework is proposed,which maintains two student policies to transfer knowledge to each other and no longer depends on the teacher policy.However,if one of the student policies cannot surpass the other through self-learning,or if the two student policies converge after distillation,the deep reinforcement learning algorithm combined with DPD degenerates into a single policy gradient optimization approach.To address the problems mentioned above,the concept of similarity between student policies is given,and the similarity constrained dual policy distillation(SCDPD) framework is proposed.The framework dynamically adjusts the similarity between two students' policies in the process of knowledge transfer,and has been theoretically shown to be effective in enhancing the exploration of students′ policies as well as the stability of algorithms.Experimental results show that the SCDPD-SAC algorithm and SCDPD-PPO algorithm,which combine SCDPD with classical off-policy and on-policy deep reinforcement learning algorithms,have better performance compared with classical algorithms on multiple continuous control tasks.

Key words: Deepre inforcement learning, Policy distillation, Similarity constraint, Knowledge transfer, Continuous control tasks

CLC Number: 

  • TP181
[1]SUTTON R S,BARTO A G.Reinforcement learning:An introduction [M].Massachusetts:MIT press,2018.
[2]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of go without human knowledge [J].Nature,2017,550(7676):354-359.
[3]KOBER J,BAGNELL J A,PETERS J.Reinforcement learning in robotics:A survey [J].The International Journal of Robotics Research,2013,32(11):1238-1274.
[4]SALLAB A E,ABDOU M,PEROT E,et al.Deep reinforcement learning framework for autonomous driving [J].Electronic Imaging,2017,2017(19):70-76.
[5]LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep reinforcement learning [J].Chinese Journal of Computers,2018,41(1):1-27.
[6]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deepreinforcement learning [J].Nature,2015,518(7540):529-533.
[7]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.2016:1928-1937.
[8]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[C]//ICLR.2016.
[9]FUJIMOTO S,HOOF H,MEGER D.Addressing function approximation error in actor-critic methods[C]//International Conference on Machine Learning.2018:1587-1596.
[10]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//International Conference on Machine Learning.2015:1889-1897.
[11]SCHULMAN J,MORITZ P,LEVINE S,et al.High-dimen-sional continuous control using generalized advantage estimation[J].arXiv:1506.02438,2015.
[12]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximalpolicy optimization algorithms[J].arXiv:1707.06347,2017.
[13]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//International Conference on Machine Learning.2018:1861-1870.
[14]SALIMANS T,HO J,CHEN X,et al.Evolution strategies as a scalable alternative to reinforcement learning[J].arXiv:1703.03864,2017.
[15]TAO Y,GENC S,CHUNG J,et al.REPAINT:KnowledgeTransfer in Deep Reinforcement Learning[C]//International Conference on Machine Learning.2021:10141-10152.
[16]BARRETO A,BORSA D,QUAN J,et al.Transfer in deep reinforcement learning using successor features and generalised policy improvement[C]//International Conference on Machine Learning.2018:501-510.
[17]CZARNECKI W M,PASCANU R,OSINDERO S,et al.Distilling policy distillation[C]//International Conference on Artificial Intelligence and Statistics.2019:1331-1340.
[18]LAI KH,ZHA D,LI Y,et al.Dual Policy Distillation[C]//International Joint Conference on Artificial Intelligence.2020:3146-3152.
[19]RUSU A A,COLMENAREJO S G,GULCEHRE C,et al.Policy distillation[J].arXiv:1151.06295,2015.
[20]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531,2015.
[21]WADHWANIA S,KIM DK,OMIDSHAFIEI S,et al.Policy distillation and value matching in multiagent reinforcement learning[C]//International Conference on Intelligent Robots and Systems.2019:8193-8200.
[22]CHEN G.A New Framework for Multi-Agent ReinforcementLearning-Centralized Training and Exploration with Decentra-lized Execution via Policy Distillation[C]//International Confe-rence on Autonomous Agents and MultiAgent Systems.2020:1801-1803.
[23]ZHA D,LAI K H,ZHOU K,et al.Experience replay optimization[C]//International Joint Conference on Artificial Intelligence.2019:4243-4249.
[24]XU T,LIU Q,ZHAO L,et al.Learning to explore via meta-po-licy gradient[C]//International Conference on Machine Lear-ning.2018:5463-5472.
[25]FANG Y,REN K,LIU W,et al.Universal Trading for Order Execution with Oracle Policy Distillation[J].arXiv:2103.10860,2021.
[26]FAN S,ZHANG X,SONG Z.Reinforced knowledge distillation:Multi-class imbalanced classifier based on policy gradient reinforcement learning [J].Neurocomputing,2021,463:422-436.
[27]HA J S,PARK Y J,CHAE H J,et al.Distilling a hierarchical policy for planning and control via representation and reinforcement learning[C]//IEEE International Conference on Robotics and Automation.2021:4459-4466.
[28]LI Z H,YU Y,CHEN Y,et al.Neural-to-Tree Policy Distillation with Policy Improvement Criterion[J].arXiv:2108.06898,2021.
[29]ZHAO C,HOSPEDALES T.Robust domain randomised rein-forcement learning through peer-to-peer distillation[C]//Asian Conference on Machine Learning.2021:1237-1252.
[30]CHA H,PARK J,KIM H,et al.Proxy experience replay:Fede-rated distillation for distributed reinforcement learning [J].IEEE Intelligent Systems,2020,35(4):94-101.
[31]SUN H,PAN X,DAI B,et al.Evolutionary Stochastic Policy Distillation[J].arXiv:2004.12909,2020.
[32]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai gym[J].arXiv:1606.01540,2016.
[1] ZHANG Qiyang, CHEN Xiliang, ZHANG Qiao. Sparse Reward Exploration Method Based on Trajectory Perception [J]. Computer Science, 2023, 50(1): 262-269.
[2] TANG Feng, FENG Xiang, YU Hui-qun. Multi-task Cooperative Optimization Algorithm Based on Adaptive Knowledge Transfer andResource Allocation [J]. Computer Science, 2022, 49(7): 254-262.
[3] ZHANG Jian-hang, LIU Quan. Deep Deterministic Policy Gradient with Episode Experience Replay [J]. Computer Science, 2021, 48(10): 37-43.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!