计算机科学 ›› 2023, Vol. 50 ›› Issue (1): 262-269.doi: 10.11896/jsjkx.220700010

• 人工智能 • 上一篇    下一篇

基于轨迹感知的稀疏奖励探索方法

张启阳, 陈希亮, 张巧   

  1. 陆军工程大学指挥控制工程学院 南京 210007
  • 收稿日期:2022-07-01 修回日期:2022-08-11 出版日期:2023-01-15 发布日期:2023-01-09
  • 通讯作者: 陈希亮(383618393@qq.com)
  • 作者简介:qiyangz@foxmail.com
  • 基金资助:
    国家自然科学基金(61806221)

Sparse Reward Exploration Method Based on Trajectory Perception

ZHANG Qiyang, CHEN Xiliang, ZHANG Qiao   

  1. College of Command and Control Engineering,Army Engineering University of PLA,Nanjing 210007,China
  • Received:2022-07-01 Revised:2022-08-11 Online:2023-01-15 Published:2023-01-09
  • About author:ZHANG Qiyang,born in 1998,postgra-duate.His main research interests include deep reinforcement learning and knowledge transfer.
    CHEN Xiliang,born in 1985,Ph.D,associate professor.His main research interests include command information system engineering and deep reinforcement learning.
  • Supported by:
    National Natural Science Foundation of China(61806221).

摘要: 现有的深度强化学习算法在处理稀疏奖励问题时常常会导致探索困难的问题,其往往只依赖于预先设计好的环境奖励,从而难以取得较好的效果。在这种场景中,需要更加细致地设计奖励,对智能体的探索状态做出更精准的判断并反馈。异步优势表演者评论家算法(Asynchronous Advantage Actor-Critic,A3C)通过并行训练来提升训练效率,提升了原有算法的训练速度,但是对于奖励稀疏的环境,其不能很好地解决探索困难的问题。针对A3C算法在稀疏奖励环境中探索效果不佳的问题,提出了一种基于探索轨迹自动感知的A3C算法(Exploration Trajectory Perception A3C,ETP-A3C)。该算法在训练中探索困难时能够感知智能体的探索轨迹,进一步判断并决策智能体的探索方向,帮助智能体尽快走出探索困境。为了验证ETP-A3C算法的有效性,将其与基线算法在超级马里奥兄弟中的5个不同环境中进行了对比实验,结果表明,所提算法在学习速度和模型稳定性上均有较明显的提升。

关键词: 人工智能, 知识迁移, 深度强化学习, A3C算法, 探索-利用问题

Abstract: When dealing with sparse reward problems,existing deep RL algorithms often lead to hard exploration,they often only rely on the pre-designed environment reward,so it is difficult to achieve good results.In this situation,it is necessary to design rewards more carefully,make more accurate judgments and feedback on the exploration status of agents.The asynchronous advantage actor-critic(A3C) algorithm improves the training efficiency through parallel training,and improves the training speed of the original algorithm.However,for the environment with sparse rewards,it cannot well solve the problem of difficult exploration.To solve the problem of poor exploration effect of A3C algorithm in sparse reward environment,A3C based on exploration trajectory perception(ETP-A3C) is proposed.The algorithm can perceive the exploration trajectory of the agent when it is difficult to explore in training,further judge and decide the exploration direction of the agent,and help the agent get out of the exploration dilemma as soon as possible.In order to verify the effectiveness of ETP-A3C algorithm,a comparative experiment is carried out with baseline algorithm in five different environments of Super Mario Brothers.The results show that this method has significantly improved the learning speed and model stability.

Key words: Artificial intelligence, Knowledge transfer, Deep reinforcement learning, Asynchronous Advantage Actor-Critic, Exploration-Utilization problem

中图分类号: 

  • TP181
[1]SUTTON R S,BARTO A G.Reinforcement learning: An introduction[M].MIT Press,2018.
[2]SILVER D,SINGH S,PRECUP D,et al.Reward is enough[J].Artificial Intelligence,2021,299: 103535.
[3]CHENTANEZ N,BARTO A,SINGH S.Intrinsically motivated reinforcement learning[C]// Proceedings of the 17th International Conference on Neural Information Processing Systems.2004:1281-1288.
[4]ZHU Z,LIN K,ZHOU J.Transfer Learning in Deep Reinforcement Learning: A Survey[J].arXiv:2009.07888,2020.
[5]PATHAK D,AGRAWAL P,EFROS A A,et al.Curiosity-dri-ven exploration by self-supervised prediction[C]//International Conference on Machine Learning.PMLR,2017: 2778-2787.
[6]TAO Y,GENC S,CHUNG J,et al.Repaint: Knowledge transfer in deep reinforcement learning[C]//International Conference on Machine Learning.PMLR,2021:10141-10152.
[7]WATKINS C J C H,DAYAN P.Q-learning[J].Machine Lear-ning,1992,8(3):279-292.
[8]RUMMERY G A,NIRANJAN M.On-line Q-learning usingconnectionist systems[M].Cambridge,UK:University of Cambridge,1994.
[9]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[10]MARBACH P,TSITSIKLIS J N.Simulation-based optimization of Markov reward processes[J].IEEE Transactions on Automatic Control,2001,46(2):191-209.
[11]SUTTON R S,MCALLESTER D,SINGH S,et al.Policy gra-dient methods for reinforcement learning with function approximation[J].Advances in Neural Information Processing Systems(NIPS 1999),2000,12:1057-1063.
[12]KONDA V R,TSITSIKLIS J N.Actorcitic agorithms[C]//Proceedings of the 12th International Conference on Neural Information Processing Systems.1999:1008-1014.
[13]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//International Conference on Machine Learning.PMLR,2015:1889-1897.
[14]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347,2017.
[15]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuouscontrol with deep reinforcement learning[C]//ICLR(Poster).2016.
[16]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.PMLR,2016:1928-1937.
[17]BURDA Y,EDWARDS H,STORKEY A,et al.Exploration by random network distillation[C]//Seventh International Confe-rence on Learning Representations.2019:1-17.
[18]BELLEMARE M G,SRINIVASAN S,OSTROVSKI G,et al.Unifying count-based exploration and intrinsic motivation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:1479-1487.
[19]MACHADO M C,BELLEMARE M G,BOWLING M.Count-based exploration with the successor representation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:5125-5133.
[20]HOUTHOOFT R,CHEN X,DUAN Y,et al.VIME:variationalinformation maximizing exploration[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:1117-1125.
[21]MOHAMED S,REZENDE D J.Variational information maxi-misation for intrinsically motivated reinforcement learning[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2.2015:2125-2133.
[22]SERT E,BAR-YAM Y,MORALES A J.Segregation dynamics with reinforcement learning and agent based modeling[J].Scientific Reports,2020,10(1):1-12.
[23]SCHULMAN J,MORITZ P,LEVINE S,et al.High-dimensionalcontinuous control using generalized advantage estimation[J].arXiv:1506.02438,2015.
[1] 黄昱洲, 王立松, 秦小麟.
一种基于深度强化学习的无人小车双层路径规划方法
Bi-level Path Planning Method for Unmanned Vehicle Based on Deep Reinforcement Learning
计算机科学, 2023, 50(1): 194-204. https://doi.org/10.11896/jsjkx.220500241
[2] 徐平安, 刘全.
基于相似度约束的双策略蒸馏深度强化学习方法
Deep Reinforcement Learning Based on Similarity Constrained Dual Policy Distillation
计算机科学, 2023, 50(1): 253-261. https://doi.org/10.11896/jsjkx.211100167
[3] 魏楠, 魏祥麟, 范建华, 薛羽, 胡永扬.
面向频谱接入深度强化学习模型的后门攻击方法
Backdoor Attack Against Deep Reinforcement Learning-based Spectrum Access Model
计算机科学, 2023, 50(1): 351-361. https://doi.org/10.11896/jsjkx.220800269
[4] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[5] 于滨, 李学华, 潘春雨, 李娜.
基于深度强化学习的边云协同资源分配算法
Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning
计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219
[6] 唐枫, 冯翔, 虞慧群.
基于自适应知识迁移与资源分配的多任务协同优化算法
Multi-task Cooperative Optimization Algorithm Based on Adaptive Knowledge Transfer andResource Allocation
计算机科学, 2022, 49(7): 254-262. https://doi.org/10.11896/jsjkx.210600184
[7] 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳.
基于深度确定性策略梯度的服务器可靠性任务卸载策略
Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient
计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040
[8] 谢万城, 李斌, 代玥玥.
空中智能反射面辅助边缘计算中基于PPO的任务卸载方案
PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing
计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249
[9] 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄.
基于遗憾探索的竞争网络强化学习智能推荐方法研究
Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration
计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226
[10] 丛颖男, 王兆毓, 朱金清.
关于法律人工智能数据和算法问题的若干思考
Insights into Dataset and Algorithm Related Problems in Artificial Intelligence for Law
计算机科学, 2022, 49(4): 74-79. https://doi.org/10.11896/jsjkx.210900191
[11] 李野, 陈松灿.
基于物理信息的神经网络:最新进展与展望
Physics-informed Neural Networks:Recent Advances and Prospects
计算机科学, 2022, 49(4): 254-262. https://doi.org/10.11896/jsjkx.210500158
[12] 李鹏, 易修文, 齐德康, 段哲文, 李天瑞.
一种基于深度学习的供热策略优化方法
Heating Strategy Optimization Method Based on Deep Learning
计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155
[13] 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮.
基于深度强化学习的无信号灯交叉路口车辆控制
DRL-based Vehicle Control Strategy for Signal-free Intersections
计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010
[14] 蔡岳, 王恩良, 孙哲, 孙知信.
基于双重指针网络的车货匹配双重序列决策研究
Study on Dual Sequence Decision-making for Trucks and Cargo Matching Based on Dual Pointer Network
计算机科学, 2022, 49(11A): 210800257-9. https://doi.org/10.11896/jsjkx.210800257
[15] 李荪, 曹峰, 刘姿杉.
面向算法模型的语音数据集质量评估方法研究
Study on Quality Evaluation Method of Speech Datasets for Algorithm Model
计算机科学, 2022, 49(11A): 210800246-6. https://doi.org/10.11896/jsjkx.210800246
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!