计算机科学 ›› 2023, Vol. 50 ›› Issue (1): 262-269.doi: 10.11896/jsjkx.220700010
张启阳, 陈希亮, 张巧
ZHANG Qiyang, CHEN Xiliang, ZHANG Qiao
摘要: 现有的深度强化学习算法在处理稀疏奖励问题时常常会导致探索困难的问题,其往往只依赖于预先设计好的环境奖励,从而难以取得较好的效果。在这种场景中,需要更加细致地设计奖励,对智能体的探索状态做出更精准的判断并反馈。异步优势表演者评论家算法(Asynchronous Advantage Actor-Critic,A3C)通过并行训练来提升训练效率,提升了原有算法的训练速度,但是对于奖励稀疏的环境,其不能很好地解决探索困难的问题。针对A3C算法在稀疏奖励环境中探索效果不佳的问题,提出了一种基于探索轨迹自动感知的A3C算法(Exploration Trajectory Perception A3C,ETP-A3C)。该算法在训练中探索困难时能够感知智能体的探索轨迹,进一步判断并决策智能体的探索方向,帮助智能体尽快走出探索困境。为了验证ETP-A3C算法的有效性,将其与基线算法在超级马里奥兄弟中的5个不同环境中进行了对比实验,结果表明,所提算法在学习速度和模型稳定性上均有较明显的提升。
中图分类号:
[1]SUTTON R S,BARTO A G.Reinforcement learning: An introduction[M].MIT Press,2018. [2]SILVER D,SINGH S,PRECUP D,et al.Reward is enough[J].Artificial Intelligence,2021,299: 103535. [3]CHENTANEZ N,BARTO A,SINGH S.Intrinsically motivated reinforcement learning[C]// Proceedings of the 17th International Conference on Neural Information Processing Systems.2004:1281-1288. [4]ZHU Z,LIN K,ZHOU J.Transfer Learning in Deep Reinforcement Learning: A Survey[J].arXiv:2009.07888,2020. [5]PATHAK D,AGRAWAL P,EFROS A A,et al.Curiosity-dri-ven exploration by self-supervised prediction[C]//International Conference on Machine Learning.PMLR,2017: 2778-2787. [6]TAO Y,GENC S,CHUNG J,et al.Repaint: Knowledge transfer in deep reinforcement learning[C]//International Conference on Machine Learning.PMLR,2021:10141-10152. [7]WATKINS C J C H,DAYAN P.Q-learning[J].Machine Lear-ning,1992,8(3):279-292. [8]RUMMERY G A,NIRANJAN M.On-line Q-learning usingconnectionist systems[M].Cambridge,UK:University of Cambridge,1994. [9]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533. [10]MARBACH P,TSITSIKLIS J N.Simulation-based optimization of Markov reward processes[J].IEEE Transactions on Automatic Control,2001,46(2):191-209. [11]SUTTON R S,MCALLESTER D,SINGH S,et al.Policy gra-dient methods for reinforcement learning with function approximation[J].Advances in Neural Information Processing Systems(NIPS 1999),2000,12:1057-1063. [12]KONDA V R,TSITSIKLIS J N.Actorcitic agorithms[C]//Proceedings of the 12th International Conference on Neural Information Processing Systems.1999:1008-1014. [13]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//International Conference on Machine Learning.PMLR,2015:1889-1897. [14]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347,2017. [15]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[C]//ICLR(Poster).2016. [16]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.PMLR,2016:1928-1937. [17]BURDA Y,EDWARDS H,STORKEY A,et al.Exploration by random network distillation[C]//Seventh International Confe-rence on Learning Representations.2019:1-17. [18]BELLEMARE M G,SRINIVASAN S,OSTROVSKI G,et al.Unifying count-based exploration and intrinsic motivation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:1479-1487. [19]MACHADO M C,BELLEMARE M G,BOWLING M.Count-based exploration with the successor representation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:5125-5133. [20]HOUTHOOFT R,CHEN X,DUAN Y,et al.VIME:variationalinformation maximizing exploration[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:1117-1125. [21]MOHAMED S,REZENDE D J.Variational information maxi-misation for intrinsically motivated reinforcement learning[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2.2015:2125-2133. [22]SERT E,BAR-YAM Y,MORALES A J.Segregation dynamics with reinforcement learning and agent based modeling[J].Scientific Reports,2020,10(1):1-12. [23]SCHULMAN J,MORITZ P,LEVINE S,et al.High-dimensionalcontinuous control using generalized advantage estimation[J].arXiv:1506.02438,2015. |
[1] | 黄昱洲, 王立松, 秦小麟. 一种基于深度强化学习的无人小车双层路径规划方法 Bi-level Path Planning Method for Unmanned Vehicle Based on Deep Reinforcement Learning 计算机科学, 2023, 50(1): 194-204. https://doi.org/10.11896/jsjkx.220500241 |
[2] | 徐平安, 刘全. 基于相似度约束的双策略蒸馏深度强化学习方法 Deep Reinforcement Learning Based on Similarity Constrained Dual Policy Distillation 计算机科学, 2023, 50(1): 253-261. https://doi.org/10.11896/jsjkx.211100167 |
[3] | 魏楠, 魏祥麟, 范建华, 薛羽, 胡永扬. 面向频谱接入深度强化学习模型的后门攻击方法 Backdoor Attack Against Deep Reinforcement Learning-based Spectrum Access Model 计算机科学, 2023, 50(1): 351-361. https://doi.org/10.11896/jsjkx.220800269 |
[4] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 |
[5] | 于滨, 李学华, 潘春雨, 李娜. 基于深度强化学习的边云协同资源分配算法 Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning 计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219 |
[6] | 唐枫, 冯翔, 虞慧群. 基于自适应知识迁移与资源分配的多任务协同优化算法 Multi-task Cooperative Optimization Algorithm Based on Adaptive Knowledge Transfer andResource Allocation 计算机科学, 2022, 49(7): 254-262. https://doi.org/10.11896/jsjkx.210600184 |
[7] | 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳. 基于深度确定性策略梯度的服务器可靠性任务卸载策略 Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient 计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040 |
[8] | 谢万城, 李斌, 代玥玥. 空中智能反射面辅助边缘计算中基于PPO的任务卸载方案 PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing 计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249 |
[9] | 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226 |
[10] | 丛颖男, 王兆毓, 朱金清. 关于法律人工智能数据和算法问题的若干思考 Insights into Dataset and Algorithm Related Problems in Artificial Intelligence for Law 计算机科学, 2022, 49(4): 74-79. https://doi.org/10.11896/jsjkx.210900191 |
[11] | 李野, 陈松灿. 基于物理信息的神经网络:最新进展与展望 Physics-informed Neural Networks:Recent Advances and Prospects 计算机科学, 2022, 49(4): 254-262. https://doi.org/10.11896/jsjkx.210500158 |
[12] | 李鹏, 易修文, 齐德康, 段哲文, 李天瑞. 一种基于深度学习的供热策略优化方法 Heating Strategy Optimization Method Based on Deep Learning 计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155 |
[13] | 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮. 基于深度强化学习的无信号灯交叉路口车辆控制 DRL-based Vehicle Control Strategy for Signal-free Intersections 计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010 |
[14] | 蔡岳, 王恩良, 孙哲, 孙知信. 基于双重指针网络的车货匹配双重序列决策研究 Study on Dual Sequence Decision-making for Trucks and Cargo Matching Based on Dual Pointer Network 计算机科学, 2022, 49(11A): 210800257-9. https://doi.org/10.11896/jsjkx.210800257 |
[15] | 李荪, 曹峰, 刘姿杉. 面向算法模型的语音数据集质量评估方法研究 Study on Quality Evaluation Method of Speech Datasets for Algorithm Model 计算机科学, 2022, 49(11A): 210800246-6. https://doi.org/10.11896/jsjkx.210800246 |
|