计算机科学 ›› 2024, Vol. 51 ›› Issue (5): 179-192.doi: 10.11896/jsjkx.230800099

• 人工智能 • 上一篇    下一篇

基于智能规划的多智能体强化学习算法

辛沅霞1, 华道阳2, 张犁3   

  1. 1 浙江大学软件学院 浙江 宁波 315103
    2 浙江大学物理学院 杭州 310027
    3 浙江大学计算机科学与技术学院 杭州 310027
  • 收稿日期:2023-08-16 修回日期:2024-01-12 出版日期:2024-05-15 发布日期:2024-05-08
  • 通讯作者: 张犁(zhangli85@zju.edu.cn)
  • 作者简介:(yxxin@zju.edu.cn)

Multi-agent Reinforcement Learning Algorithm Based on AI Planning

XIN Yuanxia1, HUA Daoyang2, ZHANG Li3   

  1. 1 School of Software Technology,Zhejiang University,Ningbo,Zhejiang 315103,China
    2 School of Physics,Zhejiang University,Hangzhou 310027,China
    3 College of Computer Science and Technology,Zhejiang University,Hangzhou 310027,China
  • Received:2023-08-16 Revised:2024-01-12 Online:2024-05-15 Published:2024-05-08
  • About author:XIN Yuanxia,born in 2000,postgra-duate.Her main research interests include artificial intelligence and multi-agent reinforcement learning.
    ZHANG Li,born in 1981,Ph.D.His main research interests include artificial intelligence,man-computer symbiosis and ubiquitous computing.

摘要: 目前深度强化学习算法在不同应用领域中已经取得诸多成果,然而在多智能体任务领域中,往往面临大规模的具有稀疏奖励的非稳态环境,低探索效率问题仍是一大挑战。由于智能规划能够根据任务的初始状态和目标状态快速制定出决策方案,该方案能够作为各智能体的初始策略,并为其探索过程提供有效指导,因此尝试将智能规划与多智能体强化学习进行结合求解,并且提出统一模型UniMP(a Unified model for Multi-agent Reinforcement Learning and AI Planning)。在此基础上,设计并建立相应的问题求解机制。首先,将多智能体强化学习任务转化为智能决策任务;其次,对其执行启发式搜索,以得到一组宏观目标,进而指导强化学习的训练,使得各智能体能够进行更加高效的探索。在多智能体即时战略对抗场景StarCraft II的各地图以及RMAICS战车模拟对战环境下进行实验,结果表明累计奖励值和胜率均有显著提升,从而验证了统一模型的可行性、求解机制的有效性以及所提算法灵活应对强化学习环境突发情况的能力。

关键词: 多智能体强化学习, 智能规划, 启发式搜索, 探索效率

Abstract: At present,deep reinforcement learning algorithms have made a lot of achievements in various fields.However,in the field of multi-agent task,agents are often faced with non-stationary environment with larger state-action space and sparse rewards,low exploration efficiency is still a big challenge.Since AI planning can quickly obtain a solution according to the initial state and target state of the task,this solution can serve as the initial strategy of each agent and provide effective guidance for its exploration process,it is attempted to combine them and propose a unified model for multi-agent reinforcement learning and AI planning(UniMP).On the basis of it,the solution mechanism of the problem can be designed and implemented.By transforming the multi-agent reinforcement learning task into an intelligent decision task,and performing heuristic search on it,a set of macroscopic goals will be obtained,which can guide the training process of reinforcement learning,so that agents can conduct more efficient exploration.Finally,experiments are carried out under the various maps of multi-agent real-time strategy game StarCraft II and RoboMaster AI Challenge Simulator 2D.The results show that the cumulative reward value and win rate are significantly improved,which verifies the feasibility of UniMP,the effectiveness of solution mechanism and the ability of our algorithm to flexibly deal with the sudden situation of reinforcement learning environment.

Key words: Multi-agent reinforcement learning, AI planning, Heuristically search, Exploration efficiency

中图分类号: 

  • TP181
[1]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing atariwith deep reinforcement learning[J].arXiv:1312.5602,2013.
[2]LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep reinforcement learning[J].Chinese Journal of Computers,2017,40(1):1-28.
[3]SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural network and tree search[J].Nature,2016,529(7587):484-489.
[4]FU J,CO-REYES J D,LEVINE S.EX2:Exploration with exemplar models for deep reinforcement learning[J].arXiv:1703.01260,2017.
[5]HARE J.Dealing with sparse rewards in reinforcement learning[J].arXiv:1910.09281,2019.
[6]PAPOUDAKIS G,CHRISTIANOS F,RAHMAN A,et al.Dea-ling with non-stationarity in multi-agent deep reinforcement learning[J].arXiv:1906.04737,2019.
[7]THOMPSON W R.On the likelihood that one unknown probability exceeds another in view of the evidence of two samples[J].Biometrika,1933,25:285-294.
[8]SUN W F,LEE C K,LEE C Y.DFAC Framework:Factorizingthe value function via quantile mixture for multi-agent distributional q-learning[J].arXiv:2102.07936,2021.
[9]ZHAO J,YANG M Y,ZHAO Y P,et al.MCMARL:Paramete-rizing value function via mixture of categorical distributions for multi-agent reinforcement learning[J].arXiv:2202.10134,2022.
[10]AUER P,CESA-BIANCHI N,FISCHER P.Finite-time analysis of themultiarmed bandit problem[J].Machine Learning,2002,47(2):235-256.
[11]CARMEL D,MARKOVITCH S.Exploration strategies formodel-based learning in multi-agent systems[J].Autonomous Agents and Multi-Agent Systems,1999,2(2):141-172.
[12]CHAKRABORTY M,CHUA K Y P,DAS S,et al.Coordinated versus decentralized exploration in multi-agent multi-armed bandits[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence.2017:164-170.
[13]ZHANG K Q,YANG Z R,BAAR T.Multi-agent reinforcement learning:A selective overview of theories and algorithms[J].arXiv:1911.10635,2019.
[14]WELD D S.Recent advances in AI planning[J].AI Magazine,2002,20:93-123.
[15]VINYALS O,BABUSCHKIN I,CZARNECKI W M,et al.Grandmaster level in starcraft II using multi-agent reinforcement learning[J].Nature,2019,575(7782):350-354.
[16]SUTTON R S,MCALLESTER D A,SINGH S,et al.Policy gradient methods for reinforcement learning with function approximation[C]//Proceedings of the 12th International Confe-rence on Neural Information Processing Systems.1999:1057-1063.
[17]FORTUNATO M,AZAR M G,PIOT B,et al.Noisy networks for exploration[C]//Proceedings of the ICLR.2018.
[18]ZHANG J W,LÜ S,ZHANG Z H,et al.Survey on deep reinforcement learning methods based on sample efficiency optimization[J].Ruan Jian Xue Bao/Journal of Software,2022,33(11):4217-4238.
[19]WITT C S D,GUPTA T,MAKOVIICHUK D,et al.Is independent learning all you need in the starcraft multi-agent challenge?[J].arXiv:2011.09533,2020.
[20]LOWE R,WU Y,TAMAR A,et al.Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:6382-6393.
[21]FOERSTER J N,FARQUHAR G,AFOURAS T,et al.Counterfactual multi-agent policy gradients[C]//Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence.2018:2974-2982.
[22]RASHID T,SAMVELYAN M,WITT C S D,et al.Monotonic value function factorisation for deep multi-agent reinforcement learning[J].Journal of Machine Learning Research,2020,21(178):7234-7284.
[23]SUNEHAG P,LEVER G,AUDR G,et al.Value-decomposition networks for cooperative multi-agent learning based on team reward[J].arXiv:1706.05296,2017.
[24]SON K,KIM D,KANG W J,et al.QTRAN:Learning to facto-rize with transformation for cooperative multi-agent reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning.2019:5887-5896.
[25]SON K,AHN S,REYES R D,et al.QTRAN++:Improved value transformation for cooperative multi-agent reinforcement learning[J].arXiv:2006.12010,2020.
[26]BONET B,GEFFNER H.Planning as heuristic search[J].Artificial Intelligence,2001,129(1):5-33.
[27]RANGANATHAN A,RIABOV A,UDREA O.Mashup-basedinformation retrieval for domain experts[C]//Proceedings of the 18th ACM Conference on Information and Knowledge Mana-gement.2009:711-720.
[28]SOHRABI S,RIABOV A,KATZ M,et al.An AI planning solution to scenario generation for enterprise risk management[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[29]KATZ M,RAM P,SOHRABI S,et al.Exploring context-free languages via planning:The case for automating machine lear-ning[C]//Proceedings of the International Conference on Automated Planning and Scheduling.2020:403-411.
[30]GARRETT C R,CHITNIS R,HOLLADAY R,et al.Integrated task and motion planning[J].Annual Review of Control Robo-tics and Autonomous Systems,2021,4:265-293.
[31]LIU J,LIANG R S,XIAN J W.An AI planning approach to factory production planning and scheduling[C]//Proceedings of 2022 International Conference on Machine Learning and Know-ledge Engineering.2022:110-114.
[32]SILVER T,CHITNIS R.PDDLGym:Gym environments fromPDDL problems[J].arXiv:2002.06432,2020.
[33]RIVLIN O,HAZAN T,KARPAS E.Generalized planning with deep reinforcement learning[J].arXiv:2005.02305,2020.
[34]GEHRING C,ASAI M,CHITNIS R,et al.Reinforcement lear-ning for classical planning:Viewing heuristics as dense reward generators[C]//Proceedings of the International Conference on Automated Planning and Scheduling.2022:588-596.
[35]LEE J,KATZ M,AGRAVANTE D J,et al.AI planning annotation for sample efficient reinforcement learning[J].arXiv:2203.00669,2022.
[36]SUTTON R S,BARTO A G,WILLIAMS R J.Reinforcementlearning is direct adaptive optimal control[J].IEEE Control Systems Magazine,1992,12(2):19-22.
[37]LEWIS F L,VRABIE D,VAMVOUDAKIS K G.Reinforcement learning and feedback control:Using natural decision methods to design optimal adaptive controllers[J].IEEE Control Systems Magazine,2012,32(6):76-105.
[38]GEREVINI A E,HASLUM P,LONG D,et al.Deterministicplanning in the fifth international planning competition:PDDL3 and experimental evaluation of the planners[J].Artificial Intelligence,2009,173(5/6):619-668.
[39]HU Y J,WANG W X,JIA H T,et al.Learning to utilize shaping rewards:A new approach of reward shaping[C]//Procee-dings of the 34th International Conference on Neural Information Processing Systems.2020:15931-15941.
[40]SAMVELYAN M,RASHID T,WITT C S D,et al.The starcraft multi-agent challenge[C]//Proceedings of the 18th International Conference on Autonomous Agents and Multi-Agent Systems.2019:2186-2188.
[41]SUKHBAATAR S,SZLAM A,FERGUS R.Learning multiagent communication with backpropagation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:2252-2260.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!