计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 297-303.doi: 10.11896/jsjkx.201000163
申怡1, 刘全1,2,3,4
SHEN Yi1, LIU Quan1,2,3,4
摘要: 强化学习领域中策略单调提升的优化算法是目前的一个研究热点,在离散型和连续型控制任务中都具有了良好的性能表现。近端策略优化(Proximal Policy Optimization,PPO)算法是一种经典策略单调提升算法,但PPO作为一种同策略(on-policy)算法,样本利用率较低。针对该问题,提出了一种基于自指导动作选择的近端策略优化算法(Proximal Policy Optimization based on Self-Directed Action Selection,SDAS-PPO)。SDAS-PPO算法不仅根据重要性采样权重对样本经验进行利用,而且增加了一个同步更新的经验池来存放自身的优秀样本经验,并利用该经验池学习到的自指导网络对动作的选择进行指导。SDAS-PPO算法大大提高了样本利用率,并保证训练网络模型时智能体能快速有效地学习。为了验证SDAS-PPO算法的有效性,将SDAS-PPO算法与TRPO算法、PPO算法和PPO-AMBER算法用于连续型控制任务Mujoco仿真平台中进行比较实验。实验结果表明,该方法在绝大多数环境下具有更好的表现。
中图分类号:
[1]SUTTON R S,BARTO A G.Reinforcement Learning:An In- troduction[M].Cambridge,MA:MIT Press,1998:6-22. [2]PARR R,LI L,TAYLOR G,et al.An Analysis of Linear Mo- dels,Linear Value-Function Approximation,and Feature Selection for Reinforcement Learning[C]//International Conference on Machine Learning.2008. [3]KOHL N,STONE P.Policy gradient reinforcement learning for fast quadrupedal locomotion[C]//IEEE International Confe-rence on Robotics & Automation.IEEE,2004. [4]BARTO A G,SUTTON R S,ANDERSON C W.Neuronlike adaptive elements that can solve difficult learning control problems[J].IEEE Transaction on Systems,Man and Cybernetics,1983,13(5):834-846. [5]SEIJEN H V,HASSELT H V,WHITESON S,et al.A theore- tical and empirical analysis of Expected Sarsa[C]//Adaptive Dynamic Programming and Reinforcement Learning,2009.IEEE,2009. [6]KIUMARSI B,LEWIS F L,MODARES H,et al.Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics[J].Automatica,2014,50(4):1167-1175. [7]TANGKARATT V,ABDOLMALEKI A,SUGIYAMA M. Guide Actor-Critic for Continuous Control[J].arXiv:1705.07606,2017. [8]KRIZHEVSKY A,SUTSKEVER I,HINTON G.ImageNet Classification with Deep Convolutional Neural Networks[J].Advances in Neural Information Processing Systems,2012,25:1097-1105. [9]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533. [10]LIU Q,ZHAI J W,ZHANG Z Z,et al.A review of deep reinforcement learning[J].Chinese Journal of Computers,2018,41(1):1-27. [11]WANG Z,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[J].Proceedings of the 33nd International Conference on Machine Learning.New York,USA,2016:692-700. [12]VAN HASSELT H,GUEZ A,SILVER D.Deep Reinforcement Learning with Double Q-Learning[C]//Proceedings of theThir-tieth AAAI Conference on Artificial Intelligence.Phoenix,USA,2016:2094-2100. [13]HAUSKNECHT M,STONE P.Deep recurrent q-learning for partially observable mdps[C]//2015 AAAI fall symposium series.2015. [14]SILVER D,LEVER G,HEESS N,et al.Deterministic policy gradient algorithms[C]//Proc. of the 31st Int. Conf. on Machine Learning.New York:ACM,2014:387-395. [15]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[J].arXiv:1509.02971,2015. [16]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po- licy optimization[C]//International Conference on Machine Learning.PMLR,2015:1889-1897. [17]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347,2017. [18]HAN S,SUNG Y.Amber:Adaptive multi-batch experience replay for continuous action control[J].arXiv:1710.04423,2017. [19]LIU H,FENG Y,MAO Y,et al.Sample-efficient policy optimization with stein control variate[J].arXiv:1710.11198,2017. [20]LING P,CAI Q P,HUANG L B.Multi-Path Policy Optimization[C]//International Conference on Autonomous Agents and Multi Agent Systems.2020:1001-1009. [21]PAN F,CAI Q,ZENG A X,et al.Policy optimization with mo- del-based explorations[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2019,33:4675-4682. [22]TOUATI A,ZHANG A,PINEAU J,et al.Stable policy optimization via off-policy divergence regularization[C]//Conference on Uncertainty in Artificial Intelligence.PMLR,2020:1328-1337. [23]LI A,FLORENSA C,CLAVERA I,et al.Sub-policy Adaptation for Hierarchical Reinforcement Learning[C]//International Conference on Learning Representations.2019. [24]YOSHIDA N,UCHIBE E,DOYA K.Reinforcement learning with state-dependent discount factor[C]//IEEE Third Joint International Conference on Development & Learning & Epigenetic Robotics.IEEE,2013. [25]FU Q M,LIU Q,SUN H K,et al.A second-order TD Error fast Q(λ) algorithm[J].Pattern Recognition and Artificial Intelligence,2013(3):282-292. [26]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai gym[J].arXiv:1606.01540,2016. [27]TODOROV E,EREZ T,TASSA Y.MuJoCo:A physics engine for model-based control[C]//2012 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS).IEEE,2012. [28]DHARIWAL P,HESSE N,MANNING C,et al.OpenAI baselines [OL].GitHub,2017.https://github.com/openai/baselines. |
[1] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 |
[2] | 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波. 基于边缘智能的频谱地图构建与分发方法 Construction and Distribution Method of REM Based on Edge Intelligence 计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148 |
[3] | 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟. 智能博弈对抗方法:博弈论与强化学习综合视角对比分析 Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning 计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174 |
[4] | 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军. 基于多智能体强化学习的端到端合作的自适应奖励方法 Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning 计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100 |
[5] | 于滨, 李学华, 潘春雨, 李娜. 基于深度强化学习的边云协同资源分配算法 Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning 计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219 |
[6] | 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳. 基于深度确定性策略梯度的服务器可靠性任务卸载策略 Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient 计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040 |
[7] | 郭雨欣, 陈秀宏. 融合BERT词嵌入表示和主题信息增强的自动摘要模型 Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement 计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101 |
[8] | 范静宇, 刘全. 基于随机加权三重Q学习的异策略最大熵强化学习算法 Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning 计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081 |
[9] | 谢万城, 李斌, 代玥玥. 空中智能反射面辅助边缘计算中基于PPO的任务卸载方案 PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing 计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249 |
[10] | 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226 |
[11] | 张佳能, 李辉, 吴昊霖, 王壮. 一种平衡探索和利用的优先经验回放方法 Exploration and Exploitation Balanced Experience Replay 计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084 |
[12] | 李鹏, 易修文, 齐德康, 段哲文, 李天瑞. 一种基于深度学习的供热策略优化方法 Heating Strategy Optimization Method Based on Deep Learning 计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155 |
[13] | 周琴, 罗飞, 丁炜超, 顾春华, 郑帅. 基于逐次超松弛技术的Double Speedy Q-Learning算法 Double Speedy Q-Learning Based on Successive Over Relaxation 计算机科学, 2022, 49(3): 239-245. https://doi.org/10.11896/jsjkx.201200173 |
[14] | 李素, 宋宝燕, 李冬, 王俊陆. 面向金融活动的复合区块链关联事件溯源方法 Composite Blockchain Associated Event Tracing Method for Financial Activities 计算机科学, 2022, 49(3): 346-353. https://doi.org/10.11896/jsjkx.210700068 |
[15] | 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮. 基于深度强化学习的无信号灯交叉路口车辆控制 DRL-based Vehicle Control Strategy for Signal-free Intersections 计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010 |
|