一种平衡探索和利用的优先经验回放方法

doi:10.11896/jsjkx.210300084

Abstract

Abstract: Experience replay can reuse past experience to update target policy and improve the utilization of samples,which has become an important component of deep reinforcement learning.Prioritized experience replay performs selective sampling based on experience replay to use samples more efficiently.Nevertheless,the current prioritized experience replay methods will reduce the diversity of samples sampled from the experience buffer,causing the neural network to converge to the local optimum.To tackle the above issue,a novel method named exploration and exploitation balanced experience replay (E3R) is proposed to ba-lances exploration and utilization.This method can comprehensively consider the exploration utility and utilization utility of the samples,and sample according to the weighted sum of two similarities.One of them is the similarity between the behavior strategy and the target strategy in the same state of action,and the other is the similarity between the current state and the past state.Besides,the E3R is combined with the policy gradient algorithm soft actor-critic and the value function algorithm deep Q lear-ning,and experiments are carried out on the suite of OpenAI gym tasks.Experimental results show that,compared to traditional random sampling and sequential differential priority sampling,E3R can achieve faster convergence speed and higher cumulative return.

Key words: Experience replay, Exploitation, Exploration, Priority sampling, Reinforcement learning, Soft actor-critic algorithm

CLC Number:

TP181

ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang. Exploration and Exploitation Balanced Experience Replay[J].Computer Science, 2022, 49(5): 179-185.

References

[1]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[2]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Maste-ring the game of Go without human knowledge[J].Nature,2017,550(7676):354-359.
[3]KOBER J,BAGNELL J A,PETERS J.Reinforcement learning in robotics:A survey[J].2013,32(11):1238-1274.
[4]GREGURI M,VUJI M,ALEXOPOULOS C,et al.Application of Deep Reinforcement Learning in Traffic Signal Control:An Overview and Impact of Open Traffic Data[J].Applied Sciences,2020,10(11):4011-4036.
[5]SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized Experience Replay[C]//International Conference on Learning Representations.2016.
[6]LIN L J.Self-improving reactive agents based on reinforcement learning,planning and teaching[J].Machine Learning,1992,8(3/4):293-321.
[7]ZHAO Y N,LIU P,ZHAO W,et al.Twice Sampling Method in Deep Q-network[J].Acta Automatica Sinica,2019,45(10):1870-1882.
[8]CAO X,WAN H,LIN Y,et al.High-Value Prioritized Expe-rience Replay for Off-Policy Reinforcement Learning[C]//2019 IEEE 31st International Conference on Tools with Artificial Intelligence.IEEE,2019:1510-1514.
[9]ZHU F,WU W,LIU Q,et al.A Deep Q-Network Method Based on Upper Confidence Bound Experience Sampling[J].Journal of Computer Research and Development,2018,55(8):1694-1705.
[10]NOVATI G,KOUMOUTSAKOS P.Remember and forget for experience replay[C]//International Conference on Machine Learning.2019:4851-4860.
[11]SUN P,ZHOU W,LI H.Attentive Experience Replay[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:5900-5907.
[12]BU F,CHANG D E.Double Prioritized State Recycled Expe-rience Replay[C]//IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia).2020:1-6.
[13]BRUIN T D,KOBER J,TUYLS K,et al.Experience Selection in Deep Reinforcement Learning for Control[J].Journal of Machine Learning Research,2018,19:1-56.
[14]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai gym[EB/OL].https://arxiv.org/abs/1606.01540.
[15]SUTTON R,BARTO A.Reinforcement learning:An introduction[M].Massachusetts:MIT press,2018.
[16]LIU Q,ZHAI J W,ZHANG Z C,et al.A Survey on Deep Reinforcement Learning[J].Chinese Journal of Computers,2018,41(1):1-27.
[17]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//International Conference on Machine Learning.2018:1861-1870.
[18]WU H L,CAI L C,GAO X.Online pheromone stringency gui-ding heuristically accelerated Q-learning[J].Application Research of Computers,2018,35(8):2323-2327.
[19]HUANG Z Y,WU H L,WANG Z,et al.DQN Algorithm Based on Averaged Neural Network Parameters[J].Computer Science,2021,48(4):223-228.
[20]TODOROV E,EREZ T,TASSA Y.Mujoco:A physics engine for model-based control[C]//International Conference on Intelligent Robots and Systems.2012:5026-5033.

Related Articles 15

[1]	LIU Xing-guang, ZHOU Li, LIU Yan, ZHANG Xiao-ying, TAN Xiang, WEI Ji-bo. Construction and Distribution Method of REM Based on Edge Intelligence [J]. Computer Science, 2022, 49(9): 236-241.
[2]	YUAN Wei-lin, LUO Jun-ren, LU Li-na, CHEN Jia-xing, ZHANG Wan-peng, CHEN Jing. Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning [J]. Computer Science, 2022, 49(8): 191-204.
[3]	SHI Dian-xi, ZHAO Chen-ran, ZHANG Yao-wen, YANG Shao-wu, ZHANG Yong-jun. Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning [J]. Computer Science, 2022, 49(8): 247-256.
[4]	YU Bin, LI Xue-hua, PAN Chun-yu, LI Na. Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning [J]. Computer Science, 2022, 49(7): 248-253.
[5]	LI Meng-fei, MAO Ying-chi, TU Zi-jian, WANG Xuan, XU Shu-fang. Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient [J]. Computer Science, 2022, 49(7): 271-279.
[6]	GUO Yu-xin, CHEN Xiu-hong. Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement [J]. Computer Science, 2022, 49(6): 313-318.
[7]	FAN Jing-yu, LIU Quan. Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning [J]. Computer Science, 2022, 49(6): 335-341.
[8]	XIE Wan-cheng, LI Bin, DAI Yue-yue. PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing [J]. Computer Science, 2022, 49(6): 3-11.
[9]	HONG Zhi-li, LAI Jun, CAO Lei, CHEN Xi-liang, XU Zhi-xiong. Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration [J]. Computer Science, 2022, 49(6): 149-157.
[10]	LI Peng, YI Xiu-wen, QI De-kang, DUAN Zhe-wen, LI Tian-rui. Heating Strategy Optimization Method Based on Deep Learning [J]. Computer Science, 2022, 49(4): 263-268.
[11]	OUYANG Zhuo, ZHOU Si-yuan, LYU Yong, TAN Guo-ping, ZHANG Yue, XIANG Liang-liang. DRL-based Vehicle Control Strategy for Signal-free Intersections [J]. Computer Science, 2022, 49(3): 46-51.
[12]	ZHOU Qin, LUO Fei, DING Wei-chao, GU Chun-hua, ZHENG Shuai. Double Speedy Q-Learning Based on Successive Over Relaxation [J]. Computer Science, 2022, 49(3): 239-245.
[13]	LI Su, SONG Bao-yan, LI Dong, WANG Jun-lu. Composite Blockchain Associated Event Tracing Method for Financial Activities [J]. Computer Science, 2022, 49(3): 346-353.
[14]	HUANG Xin-quan, LIU Ai-jun, LIANG Xiao-hu, WANG Heng. Load-balanced Geographic Routing Protocol in Aerial Sensor Network [J]. Computer Science, 2022, 49(2): 342-352.
[15]	AO Tian-yu, LIU Quan. Upper Confidence Bound Exploration with Fast Convergence [J]. Computer Science, 2022, 49(1): 298-305.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Exploration and Exploitation Balanced Experience Replay

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0