基于情节经验回放的深度确定性策略梯度方法

doi:10.11896/jsjkx.200900208

Computer Science ›› 2021, Vol. 48 ›› Issue (10): 37-43.doi: 10.11896/jsjkx.200900208

• Artificial Intelligence • Previous Articles Next Articles

Deep Deterministic Policy Gradient with Episode Experience Replay

ZHANG Jian-hang¹, LIU Quan^1,2,3,4

1 School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China
2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
3 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
4 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China

Received:2020-09-30 Revised:2020-12-30 Online:2021-10-15 Published:2021-10-18
About author:ZHANG Jian-hang,born in 1995,postgraduate.His main research interests include deep reinforcement learning and so on.
LIU Quan,born in 1969,Ph.D,professor,is a member of China Computer Federation.His main research interests include deep reinforcement learning and automated reasoning.
Supported by:
National Natural Science Foundation of China(61772355,61702055,61502323,61502329),Jiangsu Province Natural Science Research University Major Projects(18KJA520011,17KJA520004),Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University(93K172014K04,93K172017K18),Suzhou Industrial Application of Basic Research Program Part(SYG201422) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Abstract

Abstract: The research on continuous control in reinforcement learning has been a hot topic in recent years.The deep deterministic policy gradient (DDPG) algorithm performs well in continuous control tasks.DDPG algorithm uses experience replay mechanism to train the network model,and in order to further improve the efficiency of experience replay mechanism in the DDPG algorithm,the cumulative reward is used as the transition classification basis,a deep deterministic policy gradient with episodic experience replay (EER-DDPG) algorithm is proposed.First of all,the transitions are stored in the unit of episode,and two replay buffersare introduced respectively to classify the transitions according to the cumulative reward.Then,the quality of policy can be improved in network model training period by random sampling of the episodes with large cumulative reward.In the continuous control tasks,this algorithm is verified by experiments,and compared with DDPG algorithm,trust region policy optimization (TRPO) algorithm and proximal policy optimization (PPO) algorithm.The experimental results show that EER-DDPG algorithm has better performance.

Key words: Classifying experience replay, Continuous control tasks, Cumulative reward, Deep deterministic policy gradient, Experience replay

CLC Number:

TP181

ZHANG Jian-hang, LIU Quan. Deep Deterministic Policy Gradient with Episode Experience Replay[J].Computer Science, 2021, 48(10): 37-43.

References

[1]DORPINGHAUS M,ROLDAN E,NERI I,et al.An information theoretic analysis of sequential decision-making[C]//International Symposium on Information Theory (ISIT).IEEE,2017:3050-3054.
[2]QIN Z H,LI N,LIU X T,et al.Overview of Research on Model-free Reinforcement Learning[J].Computer Science,2021,48(3):180-187.
[3]SUTTON R S,MCALLESTER D A,SINGH S P,et al.Policy gradient methods for reinforcement learning with function approximation[C]//Advances in Neural Information Processing Systems.2000:1057-1063.
[4]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2019,521(7553):436-444.
[5]TORRADO R R,BONTRAGER P,TOGEL-IUS J,et al.Deep reinforcement learning for general video game[C]//Conference on Computational Intelligence and Games(CLG).IEEE,2018:1-8.
[6]KRETZSHMAR H,SPIES M,SPRUNK C,et al.Socially compliant mobile robot navigation via inverse reinforcement learning[J].The International Journal of Robotics Research,2016,35(11):1289-1307.
[7]LAMPLE G,CHAPLOT D S.Playing FPS games with deepreinforcement learning[C]//AAAI Conference on Artificial Intelligence.2017:2140-2146.
[8]ZHAO X,ZHANG L,DING Z,et al.Recommendations withnegative feedback via pairwise deep reinforcement learning [C]//Proceedings of the 24^th ACM SIGKDD International Confe-rence on Knowledge Discovery & Data Mining.2018:1040-1048.
[9]MMIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[10]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[J].Computer Science,2015,8(6):A187.
[11]SCHMIDHUBER J.Deep learning in neural networks:An overview[J].Neural Networks,2015,61:85-117.
[12]BAI C J,LIU P,ZHAO W,et al.Active Sampling for DeepQ-Learning Based on TD-error Adaptive Correction[J].Journal of Computer Science & Information Systems,2019,56(2):262-280.
[13]SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized experience replay[J].arXiv:1511.05952,2015.
[14]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//International Conference on Machine Learning.2015:1889-1897.
[15]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximalpolicy optimization algorithms[J].arXiv:1707.06347,2017.
[16]LEVIN E,PIERACCINI R,ECKERT W.Using Markov deci-sion process for learning dialogue strategies[C]//Proceedings of the 1998 IEEE International Conference on Acoustics,Speech and Signal Processing.1998:201-204.
[17]GRONDMAN I,BUSONIU L,LOPES G A D,et al.A survey of actor-critic reinforcement learning:standard and natural policy gradients[J].IEEE Transactions on Systems,Man,and Cybernetics,Part C (Applications and Reviews),2012,42(6):1291-1307.
[18]SILVER D,LEVER G,HEESS N,et al.Deterministic policygradient algorithms[C]//Proceedings of the International Conference on Machine Learning.2014:387-395.
[19]UHLENBECK G E,ORNSTEIN L S.On the theory of theBrownian motion[J].Physical Review,1930,36(5):823.
[20]NOVATI G,KOUMOUTSAKOS P.Remember and forget for experience replay[C]//International Conference on Machine Learning.2019:4851-4860.
[21]ZHAO Y N,LIU P,ZHAO W,et al.Twice Sampling Method in Deep Q-Network[J].Acta Automatic Sinica,2019,45(10):1870-1882.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Deep Deterministic Policy Gradient with Episode Experience Replay

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 3

Metrics

Comments

Recommended 0

[1]	ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang. Exploration and Exploitation Balanced Experience Replay [J]. Computer Science, 2022, 49(5): 179-185.
[2]	LIU Zhi, CAO Shi-peng, SHEN Yang, YANG Xi. Signal Control of Single Intersection Based on Improved Deep Reinforcement Learning Method [J]. Computer Science, 2020, 47(12): 226-232.
[3]	ZHANG Hao-yu, XIONG Kai. Improved Deep Deterministic Policy Gradient Algorithm and Its Application in Control [J]. Computer Science, 2019, 46(6A): 555-557.