计算机科学 ›› 2019, Vol. 46 ›› Issue (5): 169-174.doi: 10.11896/j.issn.1002-137X.2019.05.026

• 人工智能 • 上一篇    下一篇

基于视觉注意力机制的异步优势行动者-评论家算法

李杰1,2, 凌兴宏1,2, 伏玉琛1,2, 刘全1,2,3,4   

  1. (苏州大学计算机科学与技术学院 江苏 苏州215006)1
    (苏州大学江苏省计算机信息处理技术重点实验室 江苏 苏州215006)2
    (吉林大学符号计算与知识工程教育部重点实验室 长春130012)3
    (软件新技术与产业化协同创新中心 南京210000)4
  • 收稿日期:2018-05-10 修回日期:2018-08-11 发布日期:2019-05-15
  • 作者简介:李 杰(1994-),男,硕士生,主要研究方向为深度学习、深度强化学习;凌兴宏(1968-),男,博士,副教授,主要研究方向为机器学习、强化学习研究,E-mail:lingxinghong@suda.edu.cn(通信作者);伏玉琛(1968-),男,博士,教授,CCF高级会员,主要研究方向为强化学习、人工智能;刘 全(1969-),男,博士,教授,博士生导师,CCF高级会员,主要研究方向为机器学习、智能信息处理。

Asynchronous Advantage Actor-Critic Algorithm with Visual Attention Mechanism

LI Jie1,2, LING Xing-hong1,2, FU Yu-chen1,2, LIU Quan1,2,3,4   

  1. (School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China)1
    (Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China)2
    (Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China)3
    (Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China)4
  • Received:2018-05-10 Revised:2018-08-11 Published:2019-05-15

摘要: 异步深度强化学习能够通过多线程技术极大地减少学习模型所需要的训练时间。然而作为异步深度强化学习的一种经典算法,异步优势行动者-评论家算法没有充分利用某些具有重要价值的区域信息,网络模型的学习效率不够理想。针对此问题,文中提出一种基于视觉注意力机制的异步优势行动者-评论家模型。该模型在传统异步优势行动者-评论家算法的基础上引入了视觉注意力机制,通过计算图像各区域点的视觉重要性值,利用回归、加权等操作得到注意力机制的上下文向量,从而使Agent将注意力集中于面积较小但更具丰富价值的图像区域,加快网络模型解码速度,更高效地学习近似最优策略。实验结果表明,与传统的异步优势行动者-评论家算法相比,该模型在基于视觉感知的决策任务上具有更好的性能表现。

关键词: 视觉注意力机制, 行动者-评论家, 异步深度强化学习, 异步优势行动者-评论家

Abstract: Asynchronous deep reinforcement learning (ADRL) can greatly reduce the training time required for learning models by adopting the multiple threading techniques.However,as an exemplary algorithm of ADRL,asynchronous advantage actor-critic (A3C) algorithm fails to completely utilize some valuable regional information,leading to unsatisfactory performance for model training.Aiming at the above problem,this paper proposed an asynchronous advantage actor-critic model with visual attention mechanism (VAM-A3C).AM-A3C integrates visual attention mechanism with traditional asynchronous advantage actor-critic algorithms.By calculating the visual importance value of each area point in the whole image compared with the traditional Cofi algorithm,and obtaining the context vector of the attention mechanism via regression function and weighting function,Agent can focus on smaller but more valuable image areas to accelerate network model decoding and to learn the approximate optimal strategy more efficiently.Experimental results show the superior performance of VAM-A3C in some decision-making tasks based on visual perception compared with the traditional asynchronous deep reinforcement learning algorithm.

Key words: Actor-critic, Asynchronous advantage actor-critic, Asynchronous deep reinforcement learning, Visual attention mechanism

中图分类号: 

  • TP181
[1]YU K,JIA L,CHEN Y Q,et al.Deep learning:yesterday,today,and tomorrow[J].Journal of computer Research and Deve-lopment,2013,50(9):1799-1804.(in Chinese)余凯,贾磊,陈雨强,等.深度学习的昨天、今天和明天[J].计算机研究与发展,2013,50(9):1799-1804.
[2]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].Cambridge:MIT Press,1998.
[3]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing atari with deep reinforcement learning∥Proceedings of Workshops at the 26th Neural Information Processing Systems 2013.Lake Tahoe,USA,2013:201-220.
[4]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[5]WATKINS C J C H.Learning from Delayed Rewards[J].Robotics & Autonomous Systems,1989,15(4):233-235.
[6]VAN HASSELT H,GUEZ A,SILVER D.Deep Reinforcement Learning with Double Q-Learning[C]∥Association for the Advance of Artificial Intelligence.2016:2094-2100.
[7]SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized experience replay∥Proceedings of the 4th International Conference on Learning Representations.San Juan,Puerto Rico,2016:322-355.
[8]RUMMERY G A,NIRANJAN M.On-line Q-learning usingconnectionist systems[D].Cambridge:University of Cambridge,1994.
[9]SUTTON R S.Generalization in reinforcement learning:suc-cessful examples using sparse coarse coding[C]∥International Conference on Neural Information Processing Systems.MIT Press,1995:1038-1044.
[10]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]∥International Conference on Machine Learning.2016:1928-1937.
[11]BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate.arXiv:1409.0473,2014.
[12]XU K,BA J,KIROS R,et al.Show,attend and tell:Neural ima-ge caption generation with visual attention[C]∥International Conference on Machine Learning.2015:2048-2057.
[13]BUSONIU L,BABUSKA R,DE SCHUTTER B,et al.Rein-forcement learning and dynamic programming using function approximators[M].CRC Press,2010.
[14]WIERING M,OTTERLO M V.Reinforcement Learning:State-of-the-Art[M].Springer Publishing Company,Incorporated,2012.
[15]SUTTON R S,MCALLESTER D A,SINGH S P,et al.Policy gradient methods for reinforcement learning with function approximation[C]∥Advances in neural information processing systems.2000:1057-1063.
[16]KAKADE S.A natural policy gradient[C]∥International Conference on Neural Information Processing Systems:Natural and Synthetic.MIT Press,2001:1531-1538.
[17]SILVER D,LEVER G,HEESS N,et al.Deterministic policygradient algorithms[C]∥International Conference on International Conference on Machine Learning.2014:387-395.
[18]KONDA V R,TSITSIKLIS J N.Actor-critic algorithms[C]∥Advances in Neural Information Processing Systems.2000:1008-1014.
[19]BHATNAGAR S,GHAVAMZADEH M,LEE M,et al.Incremental natural actor-critic algorithms[C]∥Advances in Neural Information Processing Systems.2008:105-112.
[20]KONDA V R,TSITSIKLIS J N.Actor-critic algorithms[C]∥Advances in Neural Information Processing Systems.2000:1008-1014.
[1] 范静宇, 刘全.
基于随机加权三重Q学习的异策略最大熵强化学习算法
Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning
计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081
[2] 代珊珊, 刘全.
基于动作约束深度强化学习的安全自动驾驶方法
Action Constrained Deep Reinforcement Learning Based Safe Automatic Driving Method
计算机科学, 2021, 48(9): 235-243. https://doi.org/10.11896/jsjkx.201000084
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!