计算机科学 ›› 2021, Vol. 48 ›› Issue (10): 37-43.doi: 10.11896/jsjkx.200900208

• 人工智能* 上一篇    下一篇

基于情节经验回放的深度确定性策略梯度方法

张建行1, 刘全1,2,3,4   

  1. 1 苏州大学计算机科学与技术学院 江苏 苏州215006
    2 苏州大学江苏省计算机信息处理技术重点实验室 江苏 苏州215006
    3 吉林大学符号计算与知识工程教育部重点实验室 长春130012
    4 软件新技术与产业化协同创新中心 南京210000
  • 收稿日期:2020-09-30 修回日期:2020-12-30 出版日期:2021-10-15 发布日期:2021-10-18
  • 通讯作者: 刘全(quanliu@suda.edu.cn)
  • 作者简介:20185227051@stu.suda.edu.cn
  • 基金资助:
    国家自然科学基金(61772355,61702055,61502323,61502329);江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004);吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18);苏州市应用基础研究计划工业部分(SYG201422);江苏省高校优势学科建设工程资助项目

Deep Deterministic Policy Gradient with Episode Experience Replay

ZHANG Jian-hang1, LIU Quan1,2,3,4   

  1. 1 School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
    3 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
    4 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China
  • Received:2020-09-30 Revised:2020-12-30 Online:2021-10-15 Published:2021-10-18
  • About author:ZHANG Jian-hang,born in 1995,postgraduate.His main research interests include deep reinforcement learning and so on.
    LIU Quan,born in 1969,Ph.D,professor,is a member of China Computer Federation.His main research interests include deep reinforcement learning and automated reasoning.
  • Supported by:
    National Natural Science Foundation of China(61772355,61702055,61502323,61502329),Jiangsu Province Natural Science Research University Major Projects(18KJA520011,17KJA520004),Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04,93K172017K18),Suzhou Industrial Application of Basic Research Program Part(SYG201422) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

摘要: 强化学习中的连续控制问题一直是近年来的研究热点。深度确定性策略梯度(Deep Deterministic Policy Gradients,DDPG)算法在连续控制任务中表现优异。DDPG算法利用经验回放机制训练网络模型,为了进一步提高经验回放机制在DDPG算法中的效率,将情节累积回报作为样本分类依据,提出一种基于情节经验回放的深度确定性策略梯度(Deep Determinis-tic Policy Gradient with Episode Experience Replay,EER-DDPG)方法。首先,将经验样本以情节为单位进行存储,根据情节累积回报大小使用两个经验缓冲池分类存储。然后,在网络模型训练阶段着重对累积回报较大的样本进行采样,以提升训练质量。在连续控制任务中对该方法进行实验验证,并与采取随机采样的DDPG方法、置信区域策略优化(Trust Region Policy Optimization,TRPO)方法以及近端策略优化(Proximal Policy Optimization,PPO)方法进行比较。实验结果表明,EER-DDPG方法有更好的性能表现。

关键词: 分类经验回放, 经验回放, 累积回报, 连续控制任务, 深度确定性策略梯度

Abstract: The research on continuous control in reinforcement learning has been a hot topic in recent years.The deep deterministic policy gradient (DDPG) algorithm performs well in continuous control tasks.DDPG algorithm uses experience replay mechanism to train the network model,and in order to further improve the efficiency of experience replay mechanism in the DDPG algorithm,the cumulative reward is used as the transition classification basis,a deep deterministic policy gradient with episodic experience replay (EER-DDPG) algorithm is proposed.First of all,the transitions are stored in the unit of episode,and two replay buffersare introduced respectively to classify the transitions according to the cumulative reward.Then,the quality of policy can be improved in network model training period by random sampling of the episodes with large cumulative reward.In the continuous control tasks,this algorithm is verified by experiments,and compared with DDPG algorithm,trust region policy optimization (TRPO) algorithm and proximal policy optimization (PPO) algorithm.The experimental results show that EER-DDPG algorithm has better performance.

Key words: Classifying experience replay, Continuous control tasks, Cumulative reward, Deep deterministic policy gradient, Experience replay

中图分类号: 

  • TP181
[1]DORPINGHAUS M,ROLDAN E,NERI I,et al.An information theoretic analysis of sequential decision-making[C]//International Symposium on Information Theory (ISIT).IEEE,2017:3050-3054.
[2]QIN Z H,LI N,LIU X T,et al.Overview of Research on Model-free Reinforcement Learning[J].Computer Science,2021,48(3):180-187.
[3]SUTTON R S,MCALLESTER D A,SINGH S P,et al.Policy gradient methods for reinforcement learning with function approximation[C]//Advances in Neural Information Processing Systems.2000:1057-1063.
[4]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2019,521(7553):436-444.
[5]TORRADO R R,BONTRAGER P,TOGEL-IUS J,et al.Deep reinforcement learning for general video game[C]//Conference on Computational Intelligence and Games(CLG).IEEE,2018:1-8.
[6]KRETZSHMAR H,SPIES M,SPRUNK C,et al.Socially compliant mobile robot navigation via inverse reinforcement learning[J].The International Journal of Robotics Research,2016,35(11):1289-1307.
[7]LAMPLE G,CHAPLOT D S.Playing FPS games with deepreinforcement learning[C]//AAAI Conference on Artificial Intelligence.2017:2140-2146.
[8]ZHAO X,ZHANG L,DING Z,et al.Recommendations withnegative feedback via pairwise deep reinforcement learning [C]//Proceedings of the 24th ACM SIGKDD International Confe-rence on Knowledge Discovery & Data Mining.2018:1040-1048.
[9]MMIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[10]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[J].Computer Science,2015,8(6):A187.
[11]SCHMIDHUBER J.Deep learning in neural networks:An overview[J].Neural Networks,2015,61:85-117.
[12]BAI C J,LIU P,ZHAO W,et al.Active Sampling for DeepQ-Learning Based on TD-error Adaptive Correction[J].Journal of Computer Science & Information Systems,2019,56(2):262-280.
[13]SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized experience replay[J].arXiv:1511.05952,2015.
[14]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//International Conference on Machine Learning.2015:1889-1897.
[15]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximalpolicy optimization algorithms[J].arXiv:1707.06347,2017.
[16]LEVIN E,PIERACCINI R,ECKERT W.Using Markov deci-sion process for learning dialogue strategies[C]//Proceedings of the 1998 IEEE International Conference on Acoustics,Speech and Signal Processing.1998:201-204.
[17]GRONDMAN I,BUSONIU L,LOPES G A D,et al.A survey of actor-critic reinforcement learning:standard and natural policy gradients[J].IEEE Transactions on Systems,Man,and Cybernetics,Part C (Applications and Reviews),2012,42(6):1291-1307.
[18]SILVER D,LEVER G,HEESS N,et al.Deterministic policygradient algorithms[C]//Proceedings of the International Conference on Machine Learning.2014:387-395.
[19]UHLENBECK G E,ORNSTEIN L S.On the theory of theBrownian motion[J].Physical Review,1930,36(5):823.
[20]NOVATI G,KOUMOUTSAKOS P.Remember and forget for experience replay[C]//International Conference on Machine Learning.2019:4851-4860.
[21]ZHAO Y N,LIU P,ZHAO W,et al.Twice Sampling Method in Deep Q-Network[J].Acta Automatic Sinica,2019,45(10):1870-1882.
[1] 张佳能, 李辉, 吴昊霖, 王壮.
一种平衡探索和利用的优先经验回放方法
Exploration and Exploitation Balanced Experience Replay
计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084
[2] 刘志, 曹诗鹏, 沈阳, 杨曦.
基于改进深度强化学习方法的单交叉口信号控制
Signal Control of Single Intersection Based on Improved Deep Reinforcement Learning Method
计算机科学, 2020, 47(12): 226-232. https://doi.org/10.11896/jsjkx.200300021
[3] 张浩昱, 熊凯.
改进深度确定性策略梯度算法及其在控制中的应用
Improved Deep Deterministic Policy Gradient Algorithm and Its Application in Control
计算机科学, 2019, 46(6A): 555-557.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!