计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 297-303.doi: 10.11896/jsjkx.201000163

• 人工智能 • 上一篇    下一篇

基于自指导动作选择的近端策略优化算法

申怡1, 刘全1,2,3,4   

  1. 1 苏州大学计算机科学与技术学院 江苏 苏州215006
    2 苏州大学江苏省计算机信息处理技术重点实验室 江苏 苏州215006
    3 吉林大学符号计算与知识工程教育部重点实验室 长春130012
    4 软件新技术与产业化协同创新中心 南京210000
  • 收稿日期:2020-10-28 修回日期:2021-03-11 出版日期:2021-12-15 发布日期:2021-11-26
  • 通讯作者: 刘全(quanliu@suda.edu.cn)
  • 作者简介:20184227052@stu.suda.edu.cn
  • 基金资助:
    国家自然科学基金(61772355,61702055,61502323,61502329);江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18);苏州市应用基础研究计划工业部分(SYG201422);江苏省高校优势学科建设工程资助项目

Proximal Policy Optimization Based on Self-directed Action Selection

SHEN Yi1, LIU Quan1,2,3,4   

  1. 1 School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
    3 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
    4 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China
  • Received:2020-10-28 Revised:2021-03-11 Online:2021-12-15 Published:2021-11-26
  • About author:SHEN Yi,born in 1995,postgraduate.Her main research interests include deep reinforcement learning and so on.
    LIU Quan,born in 1969,Ph.D,professor,is a member of China Computer Federation.His main research interests include deep reinforcement learning and automated reasoning.
  • Supported by:
    National Natural Science Foundation of China(61772355,61702055,61502323,61502329),Jiangsu Province Na-tural Science Research University Major Projects(18KJA520011,17KJA520004),Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04,93K172017K18),Suzhou Industrial Application of Basic Research Program Part(SYG201422) and A Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

摘要: 强化学习领域中策略单调提升的优化算法是目前的一个研究热点,在离散型和连续型控制任务中都具有了良好的性能表现。近端策略优化(Proximal Policy Optimization,PPO)算法是一种经典策略单调提升算法,但PPO作为一种同策略(on-policy)算法,样本利用率较低。针对该问题,提出了一种基于自指导动作选择的近端策略优化算法(Proximal Policy Optimization based on Self-Directed Action Selection,SDAS-PPO)。SDAS-PPO算法不仅根据重要性采样权重对样本经验进行利用,而且增加了一个同步更新的经验池来存放自身的优秀样本经验,并利用该经验池学习到的自指导网络对动作的选择进行指导。SDAS-PPO算法大大提高了样本利用率,并保证训练网络模型时智能体能快速有效地学习。为了验证SDAS-PPO算法的有效性,将SDAS-PPO算法与TRPO算法、PPO算法和PPO-AMBER算法用于连续型控制任务Mujoco仿真平台中进行比较实验。实验结果表明,该方法在绝大多数环境下具有更好的表现。

关键词: 强化学习, 深度强化学习, 策略梯度, 近端策略优化, 自指导

Abstract: The optimization algorithm of monotonous improvement of strategy in reinforcement learning is a current research hotspot,and it has achieved good performance in both discrete and continuous control tasks.Proximal policy optimization(PPO)algorithm is a classic strategy monotonic promotion algorithm,but it is an on-policy algorithm with low sample utilization.To solve this problem,an algorithm named proximal policy optimization based on self-directed action selection(SDAS-PPO)is proposed.The SDAS-PPO algorithm not only uses the sample experience according to the importance sampling weight,but also adds a synchronously updated experience pool to store its own excellent sample experience,and uses the self-directed network learned from the experience pool to guide the choice of actions.The SDAS-PPO algorithm greatly improves the sample utilization rate and ensures that the intelligent body can learn quickly and effectively when training the network model.In order to verify the effectiveness of the SDAS-PPO algorithm,the SDAS-PPO algorithm and the TRPO algorithm,PPO algorithm and PPO-AMBER algorithm are used in the continuous control task Mujoco simulation platform for comparative experiments.Experimental results show that this method has better performance in most environments.

Key words: Reinforcement learning, Deep reinforcement learning, Policy gradient, Proximal policy optimization, Self-directed

中图分类号: 

  • TP181
[1]SUTTON R S,BARTO A G.Reinforcement Learning:An In- troduction[M].Cambridge,MA:MIT Press,1998:6-22.
[2]PARR R,LI L,TAYLOR G,et al.An Analysis of Linear Mo- dels,Linear Value-Function Approximation,and Feature Selection for Reinforcement Learning[C]//International Conference on Machine Learning.2008.
[3]KOHL N,STONE P.Policy gradient reinforcement learning for fast quadrupedal locomotion[C]//IEEE International Confe-rence on Robotics & Automation.IEEE,2004.
[4]BARTO A G,SUTTON R S,ANDERSON C W.Neuronlike adaptive elements that can solve difficult learning control problems[J].IEEE Transaction on Systems,Man and Cybernetics,1983,13(5):834-846.
[5]SEIJEN H V,HASSELT H V,WHITESON S,et al.A theore- tical and empirical analysis of Expected Sarsa[C]//Adaptive Dynamic Programming and Reinforcement Learning,2009.IEEE,2009.
[6]KIUMARSI B,LEWIS F L,MODARES H,et al.Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics[J].Automatica,2014,50(4):1167-1175.
[7]TANGKARATT V,ABDOLMALEKI A,SUGIYAMA M. Guide Actor-Critic for Continuous Control[J].arXiv:1705.07606,2017.
[8]KRIZHEVSKY A,SUTSKEVER I,HINTON G.ImageNet Classification with Deep Convolutional Neural Networks[J].Advances in Neural Information Processing Systems,2012,25:1097-1105.
[9]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[10]LIU Q,ZHAI J W,ZHANG Z Z,et al.A review of deep reinforcement learning[J].Chinese Journal of Computers,2018,41(1):1-27.
[11]WANG Z,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[J].Proceedings of the 33nd International Conference on Machine Learning.New York,USA,2016:692-700.
[12]VAN HASSELT H,GUEZ A,SILVER D.Deep Reinforcement Learning with Double Q-Learning[C]//Proceedings of theThir-tieth AAAI Conference on Artificial Intelligence.Phoenix,USA,2016:2094-2100.
[13]HAUSKNECHT M,STONE P.Deep recurrent q-learning for partially observable mdps[C]//2015 AAAI fall symposium series.2015.
[14]SILVER D,LEVER G,HEESS N,et al.Deterministic policy gradient algorithms[C]//Proc. of the 31st Int. Conf. on Machine Learning.New York:ACM,2014:387-395.
[15]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[J].arXiv:1509.02971,2015.
[16]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po- licy optimization[C]//International Conference on Machine Learning.PMLR,2015:1889-1897.
[17]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347,2017.
[18]HAN S,SUNG Y.Amber:Adaptive multi-batch experience replay for continuous action control[J].arXiv:1710.04423,2017.
[19]LIU H,FENG Y,MAO Y,et al.Sample-efficient policy optimization with stein control variate[J].arXiv:1710.11198,2017.
[20]LING P,CAI Q P,HUANG L B.Multi-Path Policy Optimization[C]//International Conference on Autonomous Agents and Multi Agent Systems.2020:1001-1009.
[21]PAN F,CAI Q,ZENG A X,et al.Policy optimization with mo- del-based explorations[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2019,33:4675-4682.
[22]TOUATI A,ZHANG A,PINEAU J,et al.Stable policy optimization via off-policy divergence regularization[C]//Conference on Uncertainty in Artificial Intelligence.PMLR,2020:1328-1337.
[23]LI A,FLORENSA C,CLAVERA I,et al.Sub-policy Adaptation for Hierarchical Reinforcement Learning[C]//International Conference on Learning Representations.2019.
[24]YOSHIDA N,UCHIBE E,DOYA K.Reinforcement learning with state-dependent discount factor[C]//IEEE Third Joint International Conference on Development & Learning & Epigenetic Robotics.IEEE,2013.
[25]FU Q M,LIU Q,SUN H K,et al.A second-order TD Error fast Q(λ) algorithm[J].Pattern Recognition and Artificial Intelligence,2013(3):282-292.
[26]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai gym[J].arXiv:1606.01540,2016.
[27]TODOROV E,EREZ T,TASSA Y.MuJoCo:A physics engine for model-based control[C]//2012 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS).IEEE,2012.
[28]DHARIWAL P,HESSE N,MANNING C,et al.OpenAI baselines [OL].GitHub,2017.https://github.com/openai/baselines.
[1] 代珊珊, 刘全. 基于动作约束深度强化学习的安全自动驾驶方法[J]. 计算机科学, 2021, 48(9): 235-243.
[2] 吴少波, 傅启明, 陈建平, 吴宏杰, 陆悠. 基于相对熵的元逆强化学习方法[J]. 计算机科学, 2021, 48(9): 257-263.
[3] 成昭炜, 沈航, 汪悦, 王敏, 白光伟. 基于深度强化学习的无人机辅助弹性视频多播机制[J]. 计算机科学, 2021, 48(9): 271-277.
[4] 周仕承, 刘京菊, 钟晓峰, 卢灿举. 基于深度强化学习的智能化渗透测试路径发现[J]. 计算机科学, 2021, 48(7): 40-46.
[5] 李贝贝, 宋佳芮, 杜卿芸, 何俊江. DRL-IDS:基于深度强化学习的工业物联网入侵检测系统[J]. 计算机科学, 2021, 48(7): 47-54.
[6] 梁俊斌, 张海涵, 蒋婵, 王天舒. 移动边缘计算中基于深度强化学习的任务卸载研究进展[J]. 计算机科学, 2021, 48(7): 316-323.
[7] 王英恺, 王青山. 能量收集无线通信系统中基于强化学习的能量分配策略[J]. 计算机科学, 2021, 48(7): 333-339.
[8] 胡潇炜, 陈羽中. 一种结合自编码器与强化学习的查询推荐方法[J]. 计算机科学, 2021, 48(6A): 206-212.
[9] 陆嘉猷, 凌兴宏, 刘全, 朱斐. 基于自适应调节策略熵的元强化学习算法[J]. 计算机科学, 2021, 48(6): 168-174.
[10] 范家宽, 王皓月, 赵生宇, 周添一, 王伟. 数据驱动的开源贡献度量化评估与持续优化方法[J]. 计算机科学, 2021, 48(5): 45-50.
[11] 范艳芳, 袁爽, 蔡英, 陈若愚. 车载边缘计算中基于深度强化学习的协同计算卸载方案[J]. 计算机科学, 2021, 48(5): 270-276.
[12] 黄志勇, 吴昊霖, 王壮, 李辉. 基于平均神经网络参数的DQN算法[J]. 计算机科学, 2021, 48(4): 223-228.
[13] 李丽, 郑嘉利, 罗文聪, 全艺璇. 基于近端策略优化的RFID室内定位算法[J]. 计算机科学, 2021, 48(4): 274-281.
[14] 秦智慧, 李宁, 刘晓彤, 刘秀磊, 佟强, 刘旭红. 无模型强化学习研究综述[J]. 计算机科学, 2021, 48(3): 180-187.
[15] 王珂, 曲桦, 赵季红. 多域SFC部署中基于强化学习的多目标优化方法[J]. 计算机科学, 2021, 48(12): 324-330.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周草臣,陈自郁,何中市. 高维多目标优化算法分析研究[J]. 计算机科学, 2014, 41(Z6): 57 -60 .
[2] 王聪,张巧丽,赵地,迟学斌. 大脑听觉系统建模研究进展[J]. 计算机科学, 2016, 43(Z11): 1 -5 .
[3] 朱国晖, 张茵, 刘秀霞, 孙天骜. 节点拓扑感知的高效节能虚拟网络映射算法[J]. 计算机科学, 2020, 47(9): 270 -274 .
[4] 陈玉涛, 许文超, 赵召娜, 刘洪恩, 王浩. 面向通用航空器运行排班及维修的策略优化[J]. 计算机科学, 2020, 47(11A): 632 -637 .
[5] 张寒烁, 杨冬菊. 基于关系图谱的科技数据分析算法[J]. 计算机科学, 2021, 48(3): 174 -179 .
[6] 郭蕊, 芦天亮, 杜彦辉. WSN中基于目标决策的源位置隐私保护方案[J]. 计算机科学, 2021, 48(5): 334 -340 .
[7] 潘孝勤, 芦天亮, 杜彦辉, 仝鑫. 基于深度学习的语音合成与转换技术综述[J]. 计算机科学, 2021, 48(8): 200 -208 .
[8] 王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究[J]. 计算机科学, 2021, 48(9): 36 -42 .
[9] 余力, 杜启翰, 岳博妍, 向君瑶, 徐冠宇, 冷友方. 基于强化学习的推荐研究综述[J]. 计算机科学, 2021, 48(10): 1 -18 .
[10] 王梓强, 胡晓光, 李晓筱, 杜卓群. 移动机器人全局路径规划算法综述[J]. 计算机科学, 2021, 48(10): 19 -29 .