Computer Science ›› 2019, Vol. 46 ›› Issue (6): 212-217.doi: 10.11896/j.issn.1002-137X.2019.06.032

Previous Articles     Next Articles

KL-divergence-based Policy Optimization

LI Jian-guo1, ZHAO Hai-tao1, SUN Shao-yuan2   

  1. (School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)1
    (School of Information Science and Technology,Donghua University,Shanghai 201620,China)2
  • Received:2018-04-23 Published:2019-06-24

Abstract: Reinforcement learning has wide application prospects in dealing with the problem of complex optimization and control.Since traditional policy gradient method cannot learn the complex policy effectively in addressing with the environment with high-dimensional and continuous action space,that causes slow convergence rate or even non-convergence,this paper proposed an online KL-divergence-based policy optimization algorithm to solve this issue.Based on Actor-Critic algorithm,the KL-divergence is introduced to construct a penalty which adds the distance between “new” and “old” into policy loss function to optimization the policy update of Actor.Furthermore,the learning step is controlled by KL-divergence to ensure the policy update with maximum learning step within security region.On the experiment of Pendulum and Humanoid,simulation results show that KLPO algorithm can learn complex strategies better,converge faster and get higher returns.

Key words: Continuous action space, KL-divergence, Policy optimization, Reinforcement learning

CLC Number: 

  • TP301
[1]MOUSAVI S S,SCHUKAT M,HOWLEY E.Deep Reinforcement Learning:An Overview[C]∥Sai Intelligent Systems Conference.Cham:Springer,2016:426-440.
[2]WU J,HE H,PENG J,et al.Continuous reinforcement learning of energy management with deep Q network for a power split hybrid electric bus[J].Applied Energy,2018,222:799-811.
[3]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-532.
[4]SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural networks and tree search.[J].Nature,2016,529(7587):484-489.
[5]WANG Z,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[C]∥Proceedings of International Conference on Machine Learning.PMLR,2016:1995-2003.
[6]DUAN Y,CHEN X,HOUTHOOFT R,et al.Benchmarking deep reinforcement learning for continuous control[C]∥International Conference on International Conference on Machine Learning.JMLR.org,2016:1329-1338.
[7]SONG R,LEWIS F L,WEI Q.Off-Policy Integral Reinforcement Learning Method to Solve Nonlinear Continuous-Time Multiplayer Nonzero-Sum Games[J].IEEE Transactions on Neural Networks & Learning Systems,2016,28(3):704.
[8]GU S,HOLLY E,LILLICRAP T,et al.Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates[C]∥International Conference on Robotics and Automation.New York:IEEE Press,2017:3389-3396.
[9]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[J].Computer Science,2015,8(6):A187.
[10]PANNE M V D,PANNE M V D,PANNE M V D,et al.DeepLoco:dynamic locomotion skills using hierarchical deep reinforcement learning[J].Acm Transactions on Graphics,2017,36(4):41.
[11]YUAN C L,RADULESCU A,DANIEL R,et al.Dynamic Interaction between Reinforcement Learning and Attention in Multidimensional Environments[J].Neuron,2017,93(2):451-463.
[12]ZHAO D,WANG B,LIU D.A supervised Actor-Critic approach for adaptive cruise control[J].Soft Computing,2013,17(11):2089-2099.
[13]THOMAS P S,BRUNSKILL E.Data-efficient off-policy policy evaluation for reinforcement learning[C]∥International Conference on Machine Learning.JMLR.org,2016:2139-2148.
[14]CHEN X G,GAO Y,FAN S G,et al.Kernel-Based Continous-Action Actor-Critic Learning[J].Patten Recognition and Artificial Intelligence,2014,27(2):103-110.(in Chinese)
陈兴国,高阳,范顺国,等.基于核方法的连续动作Actor-Critic学习[J].模式识别与人工智能,2014,27(2):103-110.
[15]VAMVOUDAKIS K G,LEWIS F L.Online actor critic algorithm to solve the continuous-time infinite horizon optimal control problem[J].Automatica,2010,46(5):878-888.
[16]LEVINE S,FINN C,DARRELL T,et al.End-to-end training of deep visuomotor policies[J].Journal of Machine Learning Research,2015,17(1):1334-1373.
[17]JOEL D,NIV Y,RUPPIN E.Actor-critic models of the basal ganglia:new anatomical and computational perspectives[J].Neural Networks,2002,15(4):535-547.
[18]FILIPPI S,CAPPÉ O,GARIVIER A.Optimism in reinforce-ment learning and Kullback-Leibler divergence[C]∥Communication,Control,and Computing.IEEE,2011:115-122.
[19]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[J].Computer Science,2015,8(6):A187.
[20]SCHULMAN J,LEVINE S,MORITZ P,et al.Trust Region Policy Optimization[C]∥Proceedings of International Conference on Machine Learning.PMLR,2015:1889-1897.
[21]YUAN C L,RADULESCU A,DANIEL R,et al.Dynamic Interaction between Reinforcement Learning and Attention in Multidimensional Environments[J].Neuron,2017,93(2):451-463.
[22]CHEN X,YANG G,WANG R.Online Selective Kernel-Based Temporal Difference Learning[J].IEEE Transactions on Neural Networks & Learning Systems,2013,24(12):1944-1950.
[1] LIU Xing-guang, ZHOU Li, LIU Yan, ZHANG Xiao-ying, TAN Xiang, WEI Ji-bo. Construction and Distribution Method of REM Based on Edge Intelligence [J]. Computer Science, 2022, 49(9): 236-241.
[2] YUAN Wei-lin, LUO Jun-ren, LU Li-na, CHEN Jia-xing, ZHANG Wan-peng, CHEN Jing. Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning [J]. Computer Science, 2022, 49(8): 191-204.
[3] SHI Dian-xi, ZHAO Chen-ran, ZHANG Yao-wen, YANG Shao-wu, ZHANG Yong-jun. Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning [J]. Computer Science, 2022, 49(8): 247-256.
[4] YU Bin, LI Xue-hua, PAN Chun-yu, LI Na. Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning [J]. Computer Science, 2022, 49(7): 248-253.
[5] LI Meng-fei, MAO Ying-chi, TU Zi-jian, WANG Xuan, XU Shu-fang. Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient [J]. Computer Science, 2022, 49(7): 271-279.
[6] GUO Yu-xin, CHEN Xiu-hong. Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement [J]. Computer Science, 2022, 49(6): 313-318.
[7] FAN Jing-yu, LIU Quan. Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning [J]. Computer Science, 2022, 49(6): 335-341.
[8] XIE Wan-cheng, LI Bin, DAI Yue-yue. PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing [J]. Computer Science, 2022, 49(6): 3-11.
[9] HONG Zhi-li, LAI Jun, CAO Lei, CHEN Xi-liang, XU Zhi-xiong. Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration [J]. Computer Science, 2022, 49(6): 149-157.
[10] ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang. Exploration and Exploitation Balanced Experience Replay [J]. Computer Science, 2022, 49(5): 179-185.
[11] LI Peng, YI Xiu-wen, QI De-kang, DUAN Zhe-wen, LI Tian-rui. Heating Strategy Optimization Method Based on Deep Learning [J]. Computer Science, 2022, 49(4): 263-268.
[12] OUYANG Zhuo, ZHOU Si-yuan, LYU Yong, TAN Guo-ping, ZHANG Yue, XIANG Liang-liang. DRL-based Vehicle Control Strategy for Signal-free Intersections [J]. Computer Science, 2022, 49(3): 46-51.
[13] ZHOU Qin, LUO Fei, DING Wei-chao, GU Chun-hua, ZHENG Shuai. Double Speedy Q-Learning Based on Successive Over Relaxation [J]. Computer Science, 2022, 49(3): 239-245.
[14] LI Su, SONG Bao-yan, LI Dong, WANG Jun-lu. Composite Blockchain Associated Event Tracing Method for Financial Activities [J]. Computer Science, 2022, 49(3): 346-353.
[15] HUANG Xin-quan, LIU Ai-jun, LIANG Xiao-hu, WANG Heng. Load-balanced Geographic Routing Protocol in Aerial Sensor Network [J]. Computer Science, 2022, 49(2): 342-352.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!