计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 297-303.doi: 10.11896/jsjkx.201000163

• 人工智能 • 上一篇    下一篇

基于自指导动作选择的近端策略优化算法

申怡1, 刘全1,2,3,4   

  1. 1 苏州大学计算机科学与技术学院 江苏 苏州215006
    2 苏州大学江苏省计算机信息处理技术重点实验室 江苏 苏州215006
    3 吉林大学符号计算与知识工程教育部重点实验室 长春130012
    4 软件新技术与产业化协同创新中心 南京210000
  • 收稿日期:2020-10-28 修回日期:2021-03-11 出版日期:2021-12-15 发布日期:2021-11-26
  • 通讯作者: 刘全(quanliu@suda.edu.cn)
  • 作者简介:20184227052@stu.suda.edu.cn
  • 基金资助:
    国家自然科学基金(61772355,61702055,61502323,61502329);江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18);苏州市应用基础研究计划工业部分(SYG201422);江苏省高校优势学科建设工程资助项目

Proximal Policy Optimization Based on Self-directed Action Selection

SHEN Yi1, LIU Quan1,2,3,4   

  1. 1 School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
    3 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
    4 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China
  • Received:2020-10-28 Revised:2021-03-11 Online:2021-12-15 Published:2021-11-26
  • About author:SHEN Yi,born in 1995,postgraduate.Her main research interests include deep reinforcement learning and so on.
    LIU Quan,born in 1969,Ph.D,professor,is a member of China Computer Federation.His main research interests include deep reinforcement learning and automated reasoning.
  • Supported by:
    National Natural Science Foundation of China(61772355,61702055,61502323,61502329),Jiangsu Province Na-tural Science Research University Major Projects(18KJA520011,17KJA520004),Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04,93K172017K18),Suzhou Industrial Application of Basic Research Program Part(SYG201422) and A Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

摘要: 强化学习领域中策略单调提升的优化算法是目前的一个研究热点,在离散型和连续型控制任务中都具有了良好的性能表现。近端策略优化(Proximal Policy Optimization,PPO)算法是一种经典策略单调提升算法,但PPO作为一种同策略(on-policy)算法,样本利用率较低。针对该问题,提出了一种基于自指导动作选择的近端策略优化算法(Proximal Policy Optimization based on Self-Directed Action Selection,SDAS-PPO)。SDAS-PPO算法不仅根据重要性采样权重对样本经验进行利用,而且增加了一个同步更新的经验池来存放自身的优秀样本经验,并利用该经验池学习到的自指导网络对动作的选择进行指导。SDAS-PPO算法大大提高了样本利用率,并保证训练网络模型时智能体能快速有效地学习。为了验证SDAS-PPO算法的有效性,将SDAS-PPO算法与TRPO算法、PPO算法和PPO-AMBER算法用于连续型控制任务Mujoco仿真平台中进行比较实验。实验结果表明,该方法在绝大多数环境下具有更好的表现。

关键词: 策略梯度, 近端策略优化, 强化学习, 深度强化学习, 自指导

Abstract: The optimization algorithm of monotonous improvement of strategy in reinforcement learning is a current research hotspot,and it has achieved good performance in both discrete and continuous control tasks.Proximal policy optimization(PPO)algorithm is a classic strategy monotonic promotion algorithm,but it is an on-policy algorithm with low sample utilization.To solve this problem,an algorithm named proximal policy optimization based on self-directed action selection(SDAS-PPO)is proposed.The SDAS-PPO algorithm not only uses the sample experience according to the importance sampling weight,but also adds a synchronously updated experience pool to store its own excellent sample experience,and uses the self-directed network learned from the experience pool to guide the choice of actions.The SDAS-PPO algorithm greatly improves the sample utilization rate and ensures that the intelligent body can learn quickly and effectively when training the network model.In order to verify the effectiveness of the SDAS-PPO algorithm,the SDAS-PPO algorithm and the TRPO algorithm,PPO algorithm and PPO-AMBER algorithm are used in the continuous control task Mujoco simulation platform for comparative experiments.Experimental results show that this method has better performance in most environments.

Key words: Deep reinforcement learning, Policy gradient, Proximal policy optimization, Reinforcement learning, Self-directed

中图分类号: 

  • TP181
[1]SUTTON R S,BARTO A G.Reinforcement Learning:An In- troduction[M].Cambridge,MA:MIT Press,1998:6-22.
[2]PARR R,LI L,TAYLOR G,et al.An Analysis of Linear Mo- dels,Linear Value-Function Approximation,and Feature Selection for Reinforcement Learning[C]//International Conference on Machine Learning.2008.
[3]KOHL N,STONE P.Policy gradient reinforcement learning for fast quadrupedal locomotion[C]//IEEE International Confe-rence on Robotics & Automation.IEEE,2004.
[4]BARTO A G,SUTTON R S,ANDERSON C W.Neuronlike adaptive elements that can solve difficult learning control problems[J].IEEE Transaction on Systems,Man and Cybernetics,1983,13(5):834-846.
[5]SEIJEN H V,HASSELT H V,WHITESON S,et al.A theore- tical and empirical analysis of Expected Sarsa[C]//Adaptive Dynamic Programming and Reinforcement Learning,2009.IEEE,2009.
[6]KIUMARSI B,LEWIS F L,MODARES H,et al.Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics[J].Automatica,2014,50(4):1167-1175.
[7]TANGKARATT V,ABDOLMALEKI A,SUGIYAMA M. Guide Actor-Critic for Continuous Control[J].arXiv:1705.07606,2017.
[8]KRIZHEVSKY A,SUTSKEVER I,HINTON G.ImageNet Classification with Deep Convolutional Neural Networks[J].Advances in Neural Information Processing Systems,2012,25:1097-1105.
[9]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[10]LIU Q,ZHAI J W,ZHANG Z Z,et al.A review of deep reinforcement learning[J].Chinese Journal of Computers,2018,41(1):1-27.
[11]WANG Z,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[J].Proceedings of the 33nd International Conference on Machine Learning.New York,USA,2016:692-700.
[12]VAN HASSELT H,GUEZ A,SILVER D.Deep Reinforcement Learning with Double Q-Learning[C]//Proceedings of theThir-tieth AAAI Conference on Artificial Intelligence.Phoenix,USA,2016:2094-2100.
[13]HAUSKNECHT M,STONE P.Deep recurrent q-learning for partially observable mdps[C]//2015 AAAI fall symposium series.2015.
[14]SILVER D,LEVER G,HEESS N,et al.Deterministic policy gradient algorithms[C]//Proc. of the 31st Int. Conf. on Machine Learning.New York:ACM,2014:387-395.
[15]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[J].arXiv:1509.02971,2015.
[16]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po- licy optimization[C]//International Conference on Machine Learning.PMLR,2015:1889-1897.
[17]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347,2017.
[18]HAN S,SUNG Y.Amber:Adaptive multi-batch experience replay for continuous action control[J].arXiv:1710.04423,2017.
[19]LIU H,FENG Y,MAO Y,et al.Sample-efficient policy optimization with stein control variate[J].arXiv:1710.11198,2017.
[20]LING P,CAI Q P,HUANG L B.Multi-Path Policy Optimization[C]//International Conference on Autonomous Agents and Multi Agent Systems.2020:1001-1009.
[21]PAN F,CAI Q,ZENG A X,et al.Policy optimization with mo- del-based explorations[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2019,33:4675-4682.
[22]TOUATI A,ZHANG A,PINEAU J,et al.Stable policy optimization via off-policy divergence regularization[C]//Conference on Uncertainty in Artificial Intelligence.PMLR,2020:1328-1337.
[23]LI A,FLORENSA C,CLAVERA I,et al.Sub-policy Adaptation for Hierarchical Reinforcement Learning[C]//International Conference on Learning Representations.2019.
[24]YOSHIDA N,UCHIBE E,DOYA K.Reinforcement learning with state-dependent discount factor[C]//IEEE Third Joint International Conference on Development & Learning & Epigenetic Robotics.IEEE,2013.
[25]FU Q M,LIU Q,SUN H K,et al.A second-order TD Error fast Q(λ) algorithm[J].Pattern Recognition and Artificial Intelligence,2013(3):282-292.
[26]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai gym[J].arXiv:1606.01540,2016.
[27]TODOROV E,EREZ T,TASSA Y.MuJoCo:A physics engine for model-based control[C]//2012 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS).IEEE,2012.
[28]DHARIWAL P,HESSE N,MANNING C,et al.OpenAI baselines [OL].GitHub,2017.https://github.com/openai/baselines.
[1] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[2] 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波.
基于边缘智能的频谱地图构建与分发方法
Construction and Distribution Method of REM Based on Edge Intelligence
计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148
[3] 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟.
智能博弈对抗方法:博弈论与强化学习综合视角对比分析
Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning
计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174
[4] 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军.
基于多智能体强化学习的端到端合作的自适应奖励方法
Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning
计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100
[5] 于滨, 李学华, 潘春雨, 李娜.
基于深度强化学习的边云协同资源分配算法
Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning
计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219
[6] 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳.
基于深度确定性策略梯度的服务器可靠性任务卸载策略
Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient
计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040
[7] 郭雨欣, 陈秀宏.
融合BERT词嵌入表示和主题信息增强的自动摘要模型
Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement
计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101
[8] 范静宇, 刘全.
基于随机加权三重Q学习的异策略最大熵强化学习算法
Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning
计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081
[9] 谢万城, 李斌, 代玥玥.
空中智能反射面辅助边缘计算中基于PPO的任务卸载方案
PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing
计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249
[10] 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄.
基于遗憾探索的竞争网络强化学习智能推荐方法研究
Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration
计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226
[11] 张佳能, 李辉, 吴昊霖, 王壮.
一种平衡探索和利用的优先经验回放方法
Exploration and Exploitation Balanced Experience Replay
计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084
[12] 李鹏, 易修文, 齐德康, 段哲文, 李天瑞.
一种基于深度学习的供热策略优化方法
Heating Strategy Optimization Method Based on Deep Learning
计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155
[13] 周琴, 罗飞, 丁炜超, 顾春华, 郑帅.
基于逐次超松弛技术的Double Speedy Q-Learning算法
Double Speedy Q-Learning Based on Successive Over Relaxation
计算机科学, 2022, 49(3): 239-245. https://doi.org/10.11896/jsjkx.201200173
[14] 李素, 宋宝燕, 李冬, 王俊陆.
面向金融活动的复合区块链关联事件溯源方法
Composite Blockchain Associated Event Tracing Method for Financial Activities
计算机科学, 2022, 49(3): 346-353. https://doi.org/10.11896/jsjkx.210700068
[15] 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮.
基于深度强化学习的无信号灯交叉路口车辆控制
DRL-based Vehicle Control Strategy for Signal-free Intersections
计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!