计算机科学 ›› 2022, Vol. 49 ›› Issue (8): 247-256.doi: 10.11896/jsjkx.210700100

• 人工智能 • 上一篇    下一篇

基于多智能体强化学习的端到端合作的自适应奖励方法

史殿习1,2,4, 赵琛然1, 张耀文3, 杨绍武1, 张拥军2   

  1. 1 国防科技大学计算机学院 长沙 410073
    2 军事科学院国防科技创新研究院 北京 100166
    3 中国人民解放军32282部队 济南 250000
    4 天津(滨海)人工智能创新中心 天津 300457
  • 收稿日期:2021-07-09 修回日期:2022-01-05 发布日期:2022-08-02
  • 通讯作者: 张拥军(yjzhang@nudt.edu.cn)
  • 作者简介:(dxshi@nudt.edu.cn)
  • 基金资助:
    国家自然科学基金(91948303)

Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning

SHI Dian-xi1,2,4, ZHAO Chen-ran1, ZHANG Yao-wen3, YANG Shao-wu1, ZHANG Yong-jun2   

  1. 1 School of Computer Science,National University of Defense Technology,Changsha 410073,China
    2 National Innovation Institute of Defense Technology,Academy of Military Sciences,Beijing 100166,China
    3 Unit 32282 of People’s Liberation Army of China,Jinan 250000,China
    4 Tianjin Artificial Intelligence Innovation Center,Tianjin 300457,China
  • Received:2021-07-09 Revised:2022-01-05 Published:2022-08-02
  • About author:SHI Dian-xi,born in 1966,Ph.D,professor,Ph.D supervisor.His main research interests include distributed object middleware technology,adaptive software technology,artificial intelligence and robot operating systems.
    ZHANG Yong-Jun,born in 1966,Ph.D,professor.His main research interests include artificial intelligence,multi-agent cooperation,machine learning and feature recognition.
  • Supported by:
    National Natural Science Foundation of China(91948303).

摘要: 目前,多智能体强化学习算法大多采用集中训练分布执行的方法,且在同构多智能体系统中取得了良好的效果。但是,由不同角色构成的异构多智能体系统往往存在信用分配问题,导致智能体很难学习到有效的合作策略。针对上述问题,提出了一种基于多智能体强化学习的端到端合作的自适应奖励方法,该方法能够促进智能体之间合作策略的生成。首先,提出了一种批正则化网络,该网络采用图神经网络对异构多智能体合作关系进行建模,利用注意力机制对关键信息进行权重计算,使用批正则化方法对生成的特征向量进行有效融合,使算法向正确的学习方向进行优化和反向传播,进而有效提升异构多智能体合作策略生成的性能;其次,基于演员-评论家方法,提出了一种双层优化的自适应奖励网络,将稀疏奖励转化为连续奖励,引导智能体根据场上形势生成合作策略。通过实验对比了当前主流的多智能体强化学习算法,结果表明,所提算法在“合作-博弈”场景中取得了显著效果,通过对策略-奖励-行为相关性的可视化分析,进一步验证了所提算法的有效性。

关键词: 多智能体强化学习, 图注意力网络, 自适应内在奖励

Abstract: At present,most multi-agent reinforcement learning(MARL) algorithms using the architecture of centralized training and decentralized execution(CTDE) have good results in homogeneous multi-agent systems.However,for heterogeneous multi-agent systems composed of different roles,there is always the problem of credit assignment,which makes it difficult for agents to learn effective cooperation strategies.To tackle the above problems,an adaptive reward method with end-to-end cooperation based on multi-agent reinforcement learning is proposed.It can promote the cooperation between agents.First,a batch regularization network is proposed.It uses a graph neural network to model the cooperative relationship of heterogeneous multi-agents.And it uses the attention mechanism to calculate the weight of key information.Also,it uses the batch regularization method to generate feature vectors.Besides,it guides the algorithm to learn in the right direction,thereby effectively improving the performance of heterogeneous multi-agent cooperative strategy generation.Second,an adaptive intrinsic reward network based on the actor-critic method is proposed.It can convert sparse rewards into dense rewards,which can guide agents to generate cooperative strategies according to the situation on the field.Through experiments,compared with the current mainstream multi-agent reinforcement learning algorithms,the proposed method has achieved significantly good results in the “cooperative-game” scenario.In addition,the visual analysis of the strategy-reward-behavior correlation further verifies the effectiveness of the proposed method.

Key words: Adaptive intrinsic reward, Graph attention network, Multi-agent reinforcement learning

中图分类号: 

  • TP391
[1]WIERING M A.Multi-agent reinforcement learning for traffic light control[C]//Machine Learning:Proceedings of the Seventeenth International Conference(ICML’2000).2000:1151-1158.
[2]SALLAB A E L,ABDOU M,PEROT E,et al.Deep reinforcement learning framework for autonomous driving[J].Electronic Imaging,2017,2017(19):70-76.
[3]ZHAI Y Y.Multi-agent reinforcement learning-driven dynamic channel allocation for unmanned aerial vehicles [J/OL].Telecommunications Technology.http://kns.cnki.net/kcms/detail/51.1267.TN.20220304.1008.002.html.
[4]DENG Q T,HU DAN E,CAI T T,et al.Reactive Power Optimization Strategy of Distribution Network Based on Multi-Agent Deep Reinforcement Learning [J].New Technology of Electrical Engineering,2022,41(2):10-20.
[5]WU Y,ZHANG B,YANG S,et al.Energy-efficient joint communication-motion planning for relay-assisted wireless robot surveillance[C]//IEEE INFOCOM 2017-IEEE Conference on Computer Communications.IEEE,2017:1-9.
[6]WANG T,WANG J,ZHENG C,et al.Learning nearly decomposable value functions via communication minimization[J].arXiv:1910.05366,2019.
[7]LOWE R,WU Y I,TAMAR A,et al.Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Advances in Neural Information Processing Systems.2017:6379-6390.
[8]FOERSTER J,FARQUHAR G,AFOURAS T,et al.Counterfactual multi-agent policy gradients[J].arXiv:1705.08926,2017.
[9]RASHID T,SAMVELYAN M,DE WITT C S,et al.QMIX:Monotonic value function factorisation for deep multi-agent reinforcement learning[J].arXiv:1803.11485,2018.
[10]YANG Y,LUO R,LI M,et al.Mean field multi-agent reinforcement learning[J].arXiv:1802.05438,2018.
[11]JAQUES N,LAZARIDOU A,HUGHES E,et al.Social in-fluence as intrinsic motivation for multi-agent deep reinforcement learning[C]//International Conference on Machine Lear-ning.PMLR,2019:3040-3049.
[12]SUKHBAATAR S,FERGUS R.Learning multiagent communication with backpropagation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:2252-2260.
[13]LIU Y,WANG W,HU Y,et al.Multi-Agent Game Abstraction via Graph Attention Neural Network[C]//AAAI.2020:7211-7218.
[14]YOU J,LIU B,YING Z,et al.Graph convolutional policy network for goal-directed molecular graph generation[C]//Advances in Neural Information Processing Systems.2018:6410-6421.
[15]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[16]KAPETANAKIS S,KUDENKO D.Reinforcement learning of coordination in heterogeneous cooperative multi-agent systems[M]//Adaptive Agents and Multi-Agent Systems II.Berlin:Springer,2004:119-131.
[17]IOFFE S,SZEGEDY C.Batch normalization:Accelerating deep network training by reducing internal covariate shift[C]//International Conference on Machine Learning.PMLR,2015:448-456.
[18]WANG W,YANG T,LIU Y,et al.From Few to More:Large-Scale Dynamic Multiagent Curriculum Learning[C]//AAAI.2020:7293-7300.
[19]ZAMBALDI V,RAPOSO D,SANTORO A,et al.Relationaldeep reinforcement learning[J].arXiv:1806.01830,2018.
[20]TACCHETTI A,SONG H F,MEDIANO P A M,et al.Relational forward models for multi-agent learning[J].arXiv:1809.11044,2018.
[21]MALYSHEVA A,SUNG T T,SOHN C B,et al.Deep multi-agent reinforcement learning with relevance graphs[J].arXiv:1811.12557,2018.
[22]ZHANG T,XU H,WANG X,et al.Multi-Agent Collaboration via Reward Attribution Decomposition[J].arXiv:2010.08531,2020.
[23]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//International Conference on Machine Learning.PMLR,2015:1889-1897.
[24]WANG Q,XIONG J,HAN L,et al.Exponentially WeightedImitation Learning for Batched Historical Data[C]//NeurIPS.2018:6291-6300.
[25]MORDATCH I,ABBEEL P.Emergence of Grounded Compositional Language in Multi-Agent Populations[J].arXiv:1703.04908,2017.
[26]VINYALS O,EWALDS T,BARTUNOV S,et al.Starcraft ii:A new challenge for reinforcement learning[J].arXiv:1708.04782,2017.
[27]SAMVELYAN M,RASHID T,DE WITT C S,et al.The starcraft multi-agent challenge[J].arXiv:1902.04043,2019.
[28]TAN M.Multi-agent reinforcement learning:Independent vs.cooperative agents[C]//Proceedings of the Tenth International Conference on Machine Learning.1993:330-337.
[29]DU Y,HAN L,FANG M,et al.Liir:Learning individual intrinsic reward in multi-agent reinforcement learning[C]//33rd Conference on Neural Information Processing Systems(NeurIPS 2019).Vancouver,Cannada,2019.
[1] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[2] 曾伟良, 陈漪皓, 姚若愚, 廖睿翔, 孙为军.
时空图注意力网络在交叉口车辆轨迹预测的应用
Application of Spatial-Temporal Graph Attention Networks in Trajectory Prediction for Vehicles at Intersections
计算机科学, 2021, 48(6A): 334-341. https://doi.org/10.11896/jsjkx.200800066
[3] 杜少华, 万怀宇, 武志昊, 林友芳.
融合文本序列和图信息的海关商品HS编码分类
Customs Commodity HS Code Classification Integrating Text Sequence and Graph Information
计算机科学, 2021, 48(4): 97-103. https://doi.org/10.11896/jsjkx.200900053
[4] 刘志鑫, 张泽华, 张杰.
基于多层次多视角的图注意力Top-N推荐方法
Top-N Recommendation Method for Graph Attention Based on Multi-level and Multi-view
计算机科学, 2021, 48(4): 104-110. https://doi.org/10.11896/jsjkx.200800027
[5] 杜威, 丁世飞.
多智能体强化学习综述
Overview on Multi-agent Reinforcement Learning
计算机科学, 2019, 46(8): 1-8. https://doi.org/10.11896/j.issn.1002-137X.2019.08.001
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!