计算机科学 ›› 2021, Vol. 48 ›› Issue (10): 37-43.doi: 10.11896/jsjkx.200900208
张建行1, 刘全1,2,3,4
ZHANG Jian-hang1, LIU Quan1,2,3,4
摘要: 强化学习中的连续控制问题一直是近年来的研究热点。深度确定性策略梯度(Deep Deterministic Policy Gradients,DDPG)算法在连续控制任务中表现优异。DDPG算法利用经验回放机制训练网络模型,为了进一步提高经验回放机制在DDPG算法中的效率,将情节累积回报作为样本分类依据,提出一种基于情节经验回放的深度确定性策略梯度(Deep Determinis-tic Policy Gradient with Episode Experience Replay,EER-DDPG)方法。首先,将经验样本以情节为单位进行存储,根据情节累积回报大小使用两个经验缓冲池分类存储。然后,在网络模型训练阶段着重对累积回报较大的样本进行采样,以提升训练质量。在连续控制任务中对该方法进行实验验证,并与采取随机采样的DDPG方法、置信区域策略优化(Trust Region Policy Optimization,TRPO)方法以及近端策略优化(Proximal Policy Optimization,PPO)方法进行比较。实验结果表明,EER-DDPG方法有更好的性能表现。
中图分类号:
[1]DORPINGHAUS M,ROLDAN E,NERI I,et al.An information theoretic analysis of sequential decision-making[C]//International Symposium on Information Theory (ISIT).IEEE,2017:3050-3054. [2]QIN Z H,LI N,LIU X T,et al.Overview of Research on Model-free Reinforcement Learning[J].Computer Science,2021,48(3):180-187. [3]SUTTON R S,MCALLESTER D A,SINGH S P,et al.Policy gradient methods for reinforcement learning with function approximation[C]//Advances in Neural Information Processing Systems.2000:1057-1063. [4]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2019,521(7553):436-444. [5]TORRADO R R,BONTRAGER P,TOGEL-IUS J,et al.Deep reinforcement learning for general video game[C]//Conference on Computational Intelligence and Games(CLG).IEEE,2018:1-8. [6]KRETZSHMAR H,SPIES M,SPRUNK C,et al.Socially compliant mobile robot navigation via inverse reinforcement learning[J].The International Journal of Robotics Research,2016,35(11):1289-1307. [7]LAMPLE G,CHAPLOT D S.Playing FPS games with deepreinforcement learning[C]//AAAI Conference on Artificial Intelligence.2017:2140-2146. [8]ZHAO X,ZHANG L,DING Z,et al.Recommendations withnegative feedback via pairwise deep reinforcement learning [C]//Proceedings of the 24th ACM SIGKDD International Confe-rence on Knowledge Discovery & Data Mining.2018:1040-1048. [9]MMIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533. [10]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[J].Computer Science,2015,8(6):A187. [11]SCHMIDHUBER J.Deep learning in neural networks:An overview[J].Neural Networks,2015,61:85-117. [12]BAI C J,LIU P,ZHAO W,et al.Active Sampling for DeepQ-Learning Based on TD-error Adaptive Correction[J].Journal of Computer Science & Information Systems,2019,56(2):262-280. [13]SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized experience replay[J].arXiv:1511.05952,2015. [14]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//International Conference on Machine Learning.2015:1889-1897. [15]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximalpolicy optimization algorithms[J].arXiv:1707.06347,2017. [16]LEVIN E,PIERACCINI R,ECKERT W.Using Markov deci-sion process for learning dialogue strategies[C]//Proceedings of the 1998 IEEE International Conference on Acoustics,Speech and Signal Processing.1998:201-204. [17]GRONDMAN I,BUSONIU L,LOPES G A D,et al.A survey of actor-critic reinforcement learning:standard and natural policy gradients[J].IEEE Transactions on Systems,Man,and Cybernetics,Part C (Applications and Reviews),2012,42(6):1291-1307. [18]SILVER D,LEVER G,HEESS N,et al.Deterministic policygradient algorithms[C]//Proceedings of the International Conference on Machine Learning.2014:387-395. [19]UHLENBECK G E,ORNSTEIN L S.On the theory of theBrownian motion[J].Physical Review,1930,36(5):823. [20]NOVATI G,KOUMOUTSAKOS P.Remember and forget for experience replay[C]//International Conference on Machine Learning.2019:4851-4860. [21]ZHAO Y N,LIU P,ZHAO W,et al.Twice Sampling Method in Deep Q-Network[J].Acta Automatic Sinica,2019,45(10):1870-1882. |
[1] | 张佳能, 李辉, 吴昊霖, 王壮. 一种平衡探索和利用的优先经验回放方法 Exploration and Exploitation Balanced Experience Replay 计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084 |
[2] | 刘志, 曹诗鹏, 沈阳, 杨曦. 基于改进深度强化学习方法的单交叉口信号控制 Signal Control of Single Intersection Based on Improved Deep Reinforcement Learning Method 计算机科学, 2020, 47(12): 226-232. https://doi.org/10.11896/jsjkx.200300021 |
[3] | 张浩昱, 熊凯. 改进深度确定性策略梯度算法及其在控制中的应用 Improved Deep Deterministic Policy Gradient Algorithm and Its Application in Control 计算机科学, 2019, 46(6A): 555-557. |
|