计算机科学 ›› 2022, Vol. 49 ›› Issue (8): 247-256.doi: 10.11896/jsjkx.210700100
史殿习1,2,4, 赵琛然1, 张耀文3, 杨绍武1, 张拥军2
SHI Dian-xi1,2,4, ZHAO Chen-ran1, ZHANG Yao-wen3, YANG Shao-wu1, ZHANG Yong-jun2
摘要: 目前,多智能体强化学习算法大多采用集中训练分布执行的方法,且在同构多智能体系统中取得了良好的效果。但是,由不同角色构成的异构多智能体系统往往存在信用分配问题,导致智能体很难学习到有效的合作策略。针对上述问题,提出了一种基于多智能体强化学习的端到端合作的自适应奖励方法,该方法能够促进智能体之间合作策略的生成。首先,提出了一种批正则化网络,该网络采用图神经网络对异构多智能体合作关系进行建模,利用注意力机制对关键信息进行权重计算,使用批正则化方法对生成的特征向量进行有效融合,使算法向正确的学习方向进行优化和反向传播,进而有效提升异构多智能体合作策略生成的性能;其次,基于演员-评论家方法,提出了一种双层优化的自适应奖励网络,将稀疏奖励转化为连续奖励,引导智能体根据场上形势生成合作策略。通过实验对比了当前主流的多智能体强化学习算法,结果表明,所提算法在“合作-博弈”场景中取得了显著效果,通过对策略-奖励-行为相关性的可视化分析,进一步验证了所提算法的有效性。
中图分类号:
[1]WIERING M A.Multi-agent reinforcement learning for traffic light control[C]//Machine Learning:Proceedings of the Seventeenth International Conference(ICML’2000).2000:1151-1158. [2]SALLAB A E L,ABDOU M,PEROT E,et al.Deep reinforcement learning framework for autonomous driving[J].Electronic Imaging,2017,2017(19):70-76. [3]ZHAI Y Y.Multi-agent reinforcement learning-driven dynamic channel allocation for unmanned aerial vehicles [J/OL].Telecommunications Technology.http://kns.cnki.net/kcms/detail/51.1267.TN.20220304.1008.002.html. [4]DENG Q T,HU DAN E,CAI T T,et al.Reactive Power Optimization Strategy of Distribution Network Based on Multi-Agent Deep Reinforcement Learning [J].New Technology of Electrical Engineering,2022,41(2):10-20. [5]WU Y,ZHANG B,YANG S,et al.Energy-efficient joint communication-motion planning for relay-assisted wireless robot surveillance[C]//IEEE INFOCOM 2017-IEEE Conference on Computer Communications.IEEE,2017:1-9. [6]WANG T,WANG J,ZHENG C,et al.Learning nearly decomposable value functions via communication minimization[J].arXiv:1910.05366,2019. [7]LOWE R,WU Y I,TAMAR A,et al.Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Advances in Neural Information Processing Systems.2017:6379-6390. [8]FOERSTER J,FARQUHAR G,AFOURAS T,et al.Counterfactual multi-agent policy gradients[J].arXiv:1705.08926,2017. [9]RASHID T,SAMVELYAN M,DE WITT C S,et al.QMIX:Monotonic value function factorisation for deep multi-agent reinforcement learning[J].arXiv:1803.11485,2018. [10]YANG Y,LUO R,LI M,et al.Mean field multi-agent reinforcement learning[J].arXiv:1802.05438,2018. [11]JAQUES N,LAZARIDOU A,HUGHES E,et al.Social in-fluence as intrinsic motivation for multi-agent deep reinforcement learning[C]//International Conference on Machine Lear-ning.PMLR,2019:3040-3049. [12]SUKHBAATAR S,FERGUS R.Learning multiagent communication with backpropagation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:2252-2260. [13]LIU Y,WANG W,HU Y,et al.Multi-Agent Game Abstraction via Graph Attention Neural Network[C]//AAAI.2020:7211-7218. [14]YOU J,LIU B,YING Z,et al.Graph convolutional policy network for goal-directed molecular graph generation[C]//Advances in Neural Information Processing Systems.2018:6410-6421. [15]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008. [16]KAPETANAKIS S,KUDENKO D.Reinforcement learning of coordination in heterogeneous cooperative multi-agent systems[M]//Adaptive Agents and Multi-Agent Systems II.Berlin:Springer,2004:119-131. [17]IOFFE S,SZEGEDY C.Batch normalization:Accelerating deep network training by reducing internal covariate shift[C]//International Conference on Machine Learning.PMLR,2015:448-456. [18]WANG W,YANG T,LIU Y,et al.From Few to More:Large-Scale Dynamic Multiagent Curriculum Learning[C]//AAAI.2020:7293-7300. [19]ZAMBALDI V,RAPOSO D,SANTORO A,et al.Relationaldeep reinforcement learning[J].arXiv:1806.01830,2018. [20]TACCHETTI A,SONG H F,MEDIANO P A M,et al.Relational forward models for multi-agent learning[J].arXiv:1809.11044,2018. [21]MALYSHEVA A,SUNG T T,SOHN C B,et al.Deep multi-agent reinforcement learning with relevance graphs[J].arXiv:1811.12557,2018. [22]ZHANG T,XU H,WANG X,et al.Multi-Agent Collaboration via Reward Attribution Decomposition[J].arXiv:2010.08531,2020. [23]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region po-licy optimization[C]//International Conference on Machine Learning.PMLR,2015:1889-1897. [24]WANG Q,XIONG J,HAN L,et al.Exponentially WeightedImitation Learning for Batched Historical Data[C]//NeurIPS.2018:6291-6300. [25]MORDATCH I,ABBEEL P.Emergence of Grounded Compositional Language in Multi-Agent Populations[J].arXiv:1703.04908,2017. [26]VINYALS O,EWALDS T,BARTUNOV S,et al.Starcraft ii:A new challenge for reinforcement learning[J].arXiv:1708.04782,2017. [27]SAMVELYAN M,RASHID T,DE WITT C S,et al.The starcraft multi-agent challenge[J].arXiv:1902.04043,2019. [28]TAN M.Multi-agent reinforcement learning:Independent vs.cooperative agents[C]//Proceedings of the Tenth International Conference on Machine Learning.1993:330-337. [29]DU Y,HAN L,FANG M,et al.Liir:Learning individual intrinsic reward in multi-agent reinforcement learning[C]//33rd Conference on Neural Information Processing Systems(NeurIPS 2019).Vancouver,Cannada,2019. |
[1] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
[2] | 曾伟良, 陈漪皓, 姚若愚, 廖睿翔, 孙为军. 时空图注意力网络在交叉口车辆轨迹预测的应用 Application of Spatial-Temporal Graph Attention Networks in Trajectory Prediction for Vehicles at Intersections 计算机科学, 2021, 48(6A): 334-341. https://doi.org/10.11896/jsjkx.200800066 |
[3] | 杜少华, 万怀宇, 武志昊, 林友芳. 融合文本序列和图信息的海关商品HS编码分类 Customs Commodity HS Code Classification Integrating Text Sequence and Graph Information 计算机科学, 2021, 48(4): 97-103. https://doi.org/10.11896/jsjkx.200900053 |
[4] | 刘志鑫, 张泽华, 张杰. 基于多层次多视角的图注意力Top-N推荐方法 Top-N Recommendation Method for Graph Attention Based on Multi-level and Multi-view 计算机科学, 2021, 48(4): 104-110. https://doi.org/10.11896/jsjkx.200800027 |
[5] | 杜威, 丁世飞. 多智能体强化学习综述 Overview on Multi-agent Reinforcement Learning 计算机科学, 2019, 46(8): 1-8. https://doi.org/10.11896/j.issn.1002-137X.2019.08.001 |
|