一种平衡探索和利用的优先经验回放方法

doi:10.11896/jsjkx.210300084

计算机科学 ›› 2022, Vol. 49 ›› Issue (5): 179-185.doi: 10.11896/jsjkx.210300084

一种平衡探索和利用的优先经验回放方法

张佳能, 李辉, 吴昊霖, 王壮

四川大学计算机学院成都610065

收稿日期:2021-03-08 修回日期:2021-08-11 出版日期:2022-05-15 发布日期:2022-05-06
通讯作者: 李辉(lihuib@scu.edu.cn)
作者简介:(jianengzhang@outlook.com)
基金资助:
全军装备预研项目(31505550302)

Exploration and Exploitation Balanced Experience Replay

ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang

College of Computer Science,Sichuan University,Chengdu 610065,China

Received:2021-03-08 Revised:2021-08-11 Online:2022-05-15 Published:2022-05-06
About author:ZHANG Jia-neng,born in 1997,postgraduate.His main research interests include deep reinforcement learning and so on.
LI Hui,born in 1970,Ph.D,professor.His main research interests include computational intelligence,battlefield simulation and virtual reality.
Supported by:
Pre-research Fund of Weapons and Equipment of China(31505550302).

摘要/Abstract

摘要： 经验回放方法可以重用过去的经验来更新目标策略,提高样本的利用率,已经成为深度强化学习的一个重要组成部分。优先经验回放在经验回放的基础上进行选择性采样,期望更好地利用经验样本。但目前的优先经验回放方式会降低从经验缓冲池采样的样本的多样性,使神经网络收敛于局部最优。针对上述问题,提出了一种平衡探索和利用的优先经验回放方法(Exploration and Exploitation Balanced Experience Replay,E3R)。该方法可以综合考虑样本的探索效用和利用效用,根据当前状态和过去状态的相似性程度以及同一状态下行为策略和目标策略采取动作的相似性程度来对样本进行采样。此外,将E3R分别与策略梯度类算法软演员-评论家算法、值函数类算法深度Q网络算法相结合,并在相应的OpenAI gym环境下进行实验。实验结果表明,相比传统随机采样和时序差分优先采样,E3R可以获得更快的收敛速度和更高的累计回报。

关键词: 经验回放, 利用, 强化学习, 软演员-评论家算法, 探索, 优先采样

Abstract: Experience replay can reuse past experience to update target policy and improve the utilization of samples,which has become an important component of deep reinforcement learning.Prioritized experience replay performs selective sampling based on experience replay to use samples more efficiently.Nevertheless,the current prioritized experience replay methods will reduce the diversity of samples sampled from the experience buffer,causing the neural network to converge to the local optimum.To tackle the above issue,a novel method named exploration and exploitation balanced experience replay (E3R) is proposed to ba-lances exploration and utilization.This method can comprehensively consider the exploration utility and utilization utility of the samples,and sample according to the weighted sum of two similarities.One of them is the similarity between the behavior strategy and the target strategy in the same state of action,and the other is the similarity between the current state and the past state.Besides,the E3R is combined with the policy gradient algorithm soft actor-critic and the value function algorithm deep Q lear-ning,and experiments are carried out on the suite of OpenAI gym tasks.Experimental results show that,compared to traditional random sampling and sequential differential priority sampling,E3R can achieve faster convergence speed and higher cumulative return.

Key words: Experience replay, Exploitation, Exploration, Priority sampling, Reinforcement learning, Soft actor-critic algorithm

中图分类号:

TP181

张佳能, 李辉, 吴昊霖, 王壮. 一种平衡探索和利用的优先经验回放方法[J]. 计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084

ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang. Exploration and Exploitation Balanced Experience Replay[J]. Computer Science, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084

参考文献

[1]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[2]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Maste-ring the game of Go without human knowledge[J].Nature,2017,550(7676):354-359.
[3]KOBER J,BAGNELL J A,PETERS J.Reinforcement learning in robotics:A survey[J].2013,32(11):1238-1274.
[4]GREGURI M,VUJI M,ALEXOPOULOS C,et al.Application of Deep Reinforcement Learning in Traffic Signal Control:An Overview and Impact of Open Traffic Data[J].Applied Sciences,2020,10(11):4011-4036.
[5]SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized Experience Replay[C]//International Conference on Learning Representations.2016.
[6]LIN L J.Self-improving reactive agents based on reinforcement learning,planning and teaching[J].Machine Learning,1992,8(3/4):293-321.
[7]ZHAO Y N,LIU P,ZHAO W,et al.Twice Sampling Method in Deep Q-network[J].Acta Automatica Sinica,2019,45(10):1870-1882.
[8]CAO X,WAN H,LIN Y,et al.High-Value Prioritized Expe-rience Replay for Off-Policy Reinforcement Learning[C]//2019 IEEE 31st International Conference on Tools with Artificial Intelligence.IEEE,2019:1510-1514.
[9]ZHU F,WU W,LIU Q,et al.A Deep Q-Network Method Based on Upper Confidence Bound Experience Sampling[J].Journal of Computer Research and Development,2018,55(8):1694-1705.
[10]NOVATI G,KOUMOUTSAKOS P.Remember and forget for experience replay[C]//International Conference on Machine Learning.2019:4851-4860.
[11]SUN P,ZHOU W,LI H.Attentive Experience Replay[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:5900-5907.
[12]BU F,CHANG D E.Double Prioritized State Recycled Expe-rience Replay[C]//IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia).2020:1-6.
[13]BRUIN T D,KOBER J,TUYLS K,et al.Experience Selection in Deep Reinforcement Learning for Control[J].Journal of Machine Learning Research,2018,19:1-56.
[14]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai gym[EB/OL].https://arxiv.org/abs/1606.01540.
[15]SUTTON R,BARTO A.Reinforcement learning:An introduction[M].Massachusetts:MIT press,2018.
[16]LIU Q,ZHAI J W,ZHANG Z C,et al.A Survey on Deep Reinforcement Learning[J].Chinese Journal of Computers,2018,41(1):1-27.
[17]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]//International Conference on Machine Learning.2018:1861-1870.
[18]WU H L,CAI L C,GAO X.Online pheromone stringency gui-ding heuristically accelerated Q-learning[J].Application Research of Computers,2018,35(8):2323-2327.
[19]HUANG Z Y,WU H L,WANG Z,et al.DQN Algorithm Based on Averaged Neural Network Parameters[J].Computer Science,2021,48(4):223-228.
[20]TODOROV E,EREZ T,TASSA Y.Mujoco:A physics engine for model-based control[C]//International Conference on Intelligent Robots and Systems.2012:5026-5033.

相关文章 15

[1]	刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波. 基于边缘智能的频谱地图构建与分发方法 Construction and Distribution Method of REM Based on Edge Intelligence 计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148
[2]	熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[3]	史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军. 基于多智能体强化学习的端到端合作的自适应奖励方法 Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning 计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100
[4]	袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟. 智能博弈对抗方法:博弈论与强化学习综合视角对比分析 Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning 计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174
[5]	于滨, 李学华, 潘春雨, 李娜. 基于深度强化学习的边云协同资源分配算法 Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning 计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219
[6]	李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳. 基于深度确定性策略梯度的服务器可靠性任务卸载策略 Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient 计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040
[7]	谢万城, 李斌, 代玥玥. 空中智能反射面辅助边缘计算中基于PPO的任务卸载方案 PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing 计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249
[8]	洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226
[9]	郭雨欣, 陈秀宏. 融合BERT词嵌入表示和主题信息增强的自动摘要模型 Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement 计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101
[10]	范静宇, 刘全. 基于随机加权三重Q学习的异策略最大熵强化学习算法 Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning 计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081
[11]	李鹏, 易修文, 齐德康, 段哲文, 李天瑞. 一种基于深度学习的供热策略优化方法 Heating Strategy Optimization Method Based on Deep Learning 计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155
[12]	欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮. 基于深度强化学习的无信号灯交叉路口车辆控制 DRL-based Vehicle Control Strategy for Signal-free Intersections 计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010
[13]	周琴, 罗飞, 丁炜超, 顾春华, 郑帅. 基于逐次超松弛技术的Double Speedy Q-Learning算法 Double Speedy Q-Learning Based on Successive Over Relaxation 计算机科学, 2022, 49(3): 239-245. https://doi.org/10.11896/jsjkx.201200173
[14]	李素, 宋宝燕, 李冬, 王俊陆. 面向金融活动的复合区块链关联事件溯源方法 Composite Blockchain Associated Event Tracing Method for Financial Activities 计算机科学, 2022, 49(3): 346-353. https://doi.org/10.11896/jsjkx.210700068
[15]	黄鑫权, 刘爱军, 梁小虎, 王桁. 空中传感器网络中负载均衡的地理路由协议 Load-balanced Geographic Routing Protocol in Aerial Sensor Network 计算机科学, 2022, 49(2): 342-352. https://doi.org/10.11896/jsjkx.201000155

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

一种平衡探索和利用的优先经验回放方法

Exploration and Exploitation Balanced Experience Replay

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0