计算机科学 ›› 2019, Vol. 46 ›› Issue (6): 212-217.doi: 10.11896/j.issn.1002-137X.2019.06.032

• 人工智能 • 上一篇    下一篇

基于KL散度的策略优化

李建国1, 赵海涛1, 孙韶媛2   

  1. (华东理工大学信息科学与工程学院 上海200237)1
    (东华大学信息科学与技术学院 上海201620)2
  • 收稿日期:2018-04-23 发布日期:2019-06-24
  • 通讯作者: 赵海涛(1974-),男,博士,教授,主要研究方向为模式识别、人工智能,E-mail:haitaozhao@ecust.edu.cn
  • 作者简介:李建国(1992-),男,硕士生,主要研究方向为模式识别、强化学习,E-mail:y30160642@mail.ecust.edu.cn;孙韶媛(1974-),女,博士,教授,主要研究方向为图像处理、计算机视觉等。
  • 基金资助:
    国家自然科学基金(61375007),上海市科委基础研究项目(15JC1400600)资助。

KL-divergence-based Policy Optimization

LI Jian-guo1, ZHAO Hai-tao1, SUN Shao-yuan2   

  1. (School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)1
    (School of Information Science and Technology,Donghua University,Shanghai 201620,China)2
  • Received:2018-04-23 Published:2019-06-24

摘要: 强化学习(Reinforcement Learning,RL)在复杂的优化和控制问题中具有广泛的应用前景。针对传统的策略梯度方法在处理高维的连续动作空间环境时无法有效学习复杂策略,导致收敛速度慢甚至无法收敛的问题,提出了一种在线学习的基于KL散度的策略优化算法(KL-divergence-based Policy Optimization,KLPO)。在Actor-Critic方法的基础上,通过引入KL散度构造惩罚项,将“新”“旧”策略间的散度结合到损失函数中,以对Actor部分的策略更新进行优化;并进一步利用KL散度控制算法更新学习步长,以确保策略每次在由KL散度定义的合理范围内以最大学习步长进行更新。分别在经典的倒立摆仿真环境和公开的连续动作空间的机器人运动环境中对所提算法进行了测试。实验结果表明,KLPO算法能够更好地学习复杂的策略,收敛速度快,并且可获取更高的回报。

关键词: KL散度, 策略优化, 连续动作空间, 强化学习

Abstract: Reinforcement learning has wide application prospects in dealing with the problem of complex optimization and control.Since traditional policy gradient method cannot learn the complex policy effectively in addressing with the environment with high-dimensional and continuous action space,that causes slow convergence rate or even non-convergence,this paper proposed an online KL-divergence-based policy optimization algorithm to solve this issue.Based on Actor-Critic algorithm,the KL-divergence is introduced to construct a penalty which adds the distance between “new” and “old” into policy loss function to optimization the policy update of Actor.Furthermore,the learning step is controlled by KL-divergence to ensure the policy update with maximum learning step within security region.On the experiment of Pendulum and Humanoid,simulation results show that KLPO algorithm can learn complex strategies better,converge faster and get higher returns.

Key words: Continuous action space, KL-divergence, Policy optimization, Reinforcement learning

中图分类号: 

  • TP301
[1]MOUSAVI S S,SCHUKAT M,HOWLEY E.Deep Reinforcement Learning:An Overview[C]∥Sai Intelligent Systems Conference.Cham:Springer,2016:426-440.
[2]WU J,HE H,PENG J,et al.Continuous reinforcement learning of energy management with deep Q network for a power split hybrid electric bus[J].Applied Energy,2018,222:799-811.
[3]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-532.
[4]SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural networks and tree search.[J].Nature,2016,529(7587):484-489.
[5]WANG Z,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[C]∥Proceedings of International Conference on Machine Learning.PMLR,2016:1995-2003.
[6]DUAN Y,CHEN X,HOUTHOOFT R,et al.Benchmarking deep reinforcement learning for continuous control[C]∥International Conference on International Conference on Machine Learning.JMLR.org,2016:1329-1338.
[7]SONG R,LEWIS F L,WEI Q.Off-Policy Integral Reinforcement Learning Method to Solve Nonlinear Continuous-Time Multiplayer Nonzero-Sum Games[J].IEEE Transactions on Neural Networks & Learning Systems,2016,28(3):704.
[8]GU S,HOLLY E,LILLICRAP T,et al.Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates[C]∥International Conference on Robotics and Automation.New York:IEEE Press,2017:3389-3396.
[9]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[J].Computer Science,2015,8(6):A187.
[10]PANNE M V D,PANNE M V D,PANNE M V D,et al.DeepLoco:dynamic locomotion skills using hierarchical deep reinforcement learning[J].Acm Transactions on Graphics,2017,36(4):41.
[11]YUAN C L,RADULESCU A,DANIEL R,et al.Dynamic Interaction between Reinforcement Learning and Attention in Multidimensional Environments[J].Neuron,2017,93(2):451-463.
[12]ZHAO D,WANG B,LIU D.A supervised Actor-Critic approach for adaptive cruise control[J].Soft Computing,2013,17(11):2089-2099.
[13]THOMAS P S,BRUNSKILL E.Data-efficient off-policy policy evaluation for reinforcement learning[C]∥International Conference on Machine Learning.JMLR.org,2016:2139-2148.
[14]CHEN X G,GAO Y,FAN S G,et al.Kernel-Based Continous-Action Actor-Critic Learning[J].Patten Recognition and Artificial Intelligence,2014,27(2):103-110.(in Chinese)
陈兴国,高阳,范顺国,等.基于核方法的连续动作Actor-Critic学习[J].模式识别与人工智能,2014,27(2):103-110.
[15]VAMVOUDAKIS K G,LEWIS F L.Online actor critic algorithm to solve the continuous-time infinite horizon optimal control problem[J].Automatica,2010,46(5):878-888.
[16]LEVINE S,FINN C,DARRELL T,et al.End-to-end training of deep visuomotor policies[J].Journal of Machine Learning Research,2015,17(1):1334-1373.
[17]JOEL D,NIV Y,RUPPIN E.Actor-critic models of the basal ganglia:new anatomical and computational perspectives[J].Neural Networks,2002,15(4):535-547.
[18]FILIPPI S,CAPPÉ O,GARIVIER A.Optimism in reinforce-ment learning and Kullback-Leibler divergence[C]∥Communication,Control,and Computing.IEEE,2011:115-122.
[19]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[J].Computer Science,2015,8(6):A187.
[20]SCHULMAN J,LEVINE S,MORITZ P,et al.Trust Region Policy Optimization[C]∥Proceedings of International Conference on Machine Learning.PMLR,2015:1889-1897.
[21]YUAN C L,RADULESCU A,DANIEL R,et al.Dynamic Interaction between Reinforcement Learning and Attention in Multidimensional Environments[J].Neuron,2017,93(2):451-463.
[22]CHEN X,YANG G,WANG R.Online Selective Kernel-Based Temporal Difference Learning[J].IEEE Transactions on Neural Networks & Learning Systems,2013,24(12):1944-1950.
[1] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[2] 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波.
基于边缘智能的频谱地图构建与分发方法
Construction and Distribution Method of REM Based on Edge Intelligence
计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148
[3] 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军.
基于多智能体强化学习的端到端合作的自适应奖励方法
Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning
计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100
[4] 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟.
智能博弈对抗方法:博弈论与强化学习综合视角对比分析
Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning
计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174
[5] 于滨, 李学华, 潘春雨, 李娜.
基于深度强化学习的边云协同资源分配算法
Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning
计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219
[6] 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳.
基于深度确定性策略梯度的服务器可靠性任务卸载策略
Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient
计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040
[7] 谢万城, 李斌, 代玥玥.
空中智能反射面辅助边缘计算中基于PPO的任务卸载方案
PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing
计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249
[8] 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄.
基于遗憾探索的竞争网络强化学习智能推荐方法研究
Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration
计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226
[9] 郭雨欣, 陈秀宏.
融合BERT词嵌入表示和主题信息增强的自动摘要模型
Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement
计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101
[10] 范静宇, 刘全.
基于随机加权三重Q学习的异策略最大熵强化学习算法
Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning
计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081
[11] 张佳能, 李辉, 吴昊霖, 王壮.
一种平衡探索和利用的优先经验回放方法
Exploration and Exploitation Balanced Experience Replay
计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084
[12] 李鹏, 易修文, 齐德康, 段哲文, 李天瑞.
一种基于深度学习的供热策略优化方法
Heating Strategy Optimization Method Based on Deep Learning
计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155
[13] 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮.
基于深度强化学习的无信号灯交叉路口车辆控制
DRL-based Vehicle Control Strategy for Signal-free Intersections
计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010
[14] 周琴, 罗飞, 丁炜超, 顾春华, 郑帅.
基于逐次超松弛技术的Double Speedy Q-Learning算法
Double Speedy Q-Learning Based on Successive Over Relaxation
计算机科学, 2022, 49(3): 239-245. https://doi.org/10.11896/jsjkx.201200173
[15] 李素, 宋宝燕, 李冬, 王俊陆.
面向金融活动的复合区块链关联事件溯源方法
Composite Blockchain Associated Event Tracing Method for Financial Activities
计算机科学, 2022, 49(3): 346-353. https://doi.org/10.11896/jsjkx.210700068
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!