基于智能规划的多智能体强化学习算法

doi:10.11896/jsjkx.230800099

Abstract

Abstract: At present,deep reinforcement learning algorithms have made a lot of achievements in various fields.However,in the field of multi-agent task,agents are often faced with non-stationary environment with larger state-action space and sparse rewards,low exploration efficiency is still a big challenge.Since AI planning can quickly obtain a solution according to the initial state and target state of the task,this solution can serve as the initial strategy of each agent and provide effective guidance for its exploration process,it is attempted to combine them and propose a unified model for multi-agent reinforcement learning and AI planning(UniMP).On the basis of it,the solution mechanism of the problem can be designed and implemented.By transforming the multi-agent reinforcement learning task into an intelligent decision task,and performing heuristic search on it,a set of macroscopic goals will be obtained,which can guide the training process of reinforcement learning,so that agents can conduct more efficient exploration.Finally,experiments are carried out under the various maps of multi-agent real-time strategy game StarCraft II and RoboMaster AI Challenge Simulator 2D.The results show that the cumulative reward value and win rate are significantly improved,which verifies the feasibility of UniMP,the effectiveness of solution mechanism and the ability of our algorithm to flexibly deal with the sudden situation of reinforcement learning environment.

Key words: Multi-agent reinforcement learning, AI planning, Heuristically search, Exploration efficiency

CLC Number:

TP181

XIN Yuanxia, HUA Daoyang, ZHANG Li. Multi-agent Reinforcement Learning Algorithm Based on AI Planning[J].Computer Science, 2024, 51(5): 179-192.

References

[1]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing atariwith deep reinforcement learning[J].arXiv:1312.5602,2013.
[2]LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep reinforcement learning[J].Chinese Journal of Computers,2017,40(1):1-28.
[3]SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural network and tree search[J].Nature,2016,529(7587):484-489.
[4]FU J,CO-REYES J D,LEVINE S.EX2:Exploration with exemplar models for deep reinforcement learning[J].arXiv:1703.01260,2017.
[5]HARE J.Dealing with sparse rewards in reinforcement learning[J].arXiv:1910.09281,2019.
[6]PAPOUDAKIS G,CHRISTIANOS F,RAHMAN A,et al.Dea-ling with non-stationarity in multi-agent deep reinforcement learning[J].arXiv:1906.04737,2019.
[7]THOMPSON W R.On the likelihood that one unknown probability exceeds another in view of the evidence of two samples[J].Biometrika,1933,25:285-294.
[8]SUN W F,LEE C K,LEE C Y.DFAC Framework:Factorizingthe value function via quantile mixture for multi-agent distributional q-learning[J].arXiv:2102.07936,2021.
[9]ZHAO J,YANG M Y,ZHAO Y P,et al.MCMARL:Paramete-rizing value function via mixture of categorical distributions for multi-agent reinforcement learning[J].arXiv:2202.10134,2022.
[10]AUER P,CESA-BIANCHI N,FISCHER P.Finite-time analysis of themultiarmed bandit problem[J].Machine Learning,2002,47(2):235-256.
[11]CARMEL D,MARKOVITCH S.Exploration strategies formodel-based learning in multi-agent systems[J].Autonomous Agents and Multi-Agent Systems,1999,2(2):141-172.
[12]CHAKRABORTY M,CHUA K Y P,DAS S,et al.Coordinated versus decentralized exploration in multi-agent multi-armed bandits[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence.2017:164-170.
[13]ZHANG K Q,YANG Z R,BAAR T.Multi-agent reinforcement learning:A selective overview of theories and algorithms[J].arXiv:1911.10635,2019.
[14]WELD D S.Recent advances in AI planning[J].AI Magazine,2002,20:93-123.
[15]VINYALS O,BABUSCHKIN I,CZARNECKI W M,et al.Grandmaster level in starcraft II using multi-agent reinforcement learning[J].Nature,2019,575(7782):350-354.
[16]SUTTON R S,MCALLESTER D A,SINGH S,et al.Policy gradient methods for reinforcement learning with function approximation[C]//Proceedings of the 12th International Confe-rence on Neural Information Processing Systems.1999:1057-1063.
[17]FORTUNATO M,AZAR M G,PIOT B,et al.Noisy networks for exploration[C]//Proceedings of the ICLR.2018.
[18]ZHANG J W,LÜ S,ZHANG Z H,et al.Survey on deep reinforcement learning methods based on sample efficiency optimization[J].Ruan Jian Xue Bao/Journal of Software,2022,33(11):4217-4238.
[19]WITT C S D,GUPTA T,MAKOVIICHUK D,et al.Is independent learning all you need in the starcraft multi-agent challenge?[J].arXiv:2011.09533,2020.
[20]LOWE R,WU Y,TAMAR A,et al.Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:6382-6393.
[21]FOERSTER J N,FARQUHAR G,AFOURAS T,et al.Counterfactual multi-agent policy gradients[C]//Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence.2018:2974-2982.
[22]RASHID T,SAMVELYAN M,WITT C S D,et al.Monotonic value function factorisation for deep multi-agent reinforcement learning[J].Journal of Machine Learning Research,2020,21(178):7234-7284.
[23]SUNEHAG P,LEVER G,AUDR G,et al.Value-decomposition networks for cooperative multi-agent learning based on team reward[J].arXiv:1706.05296,2017.
[24]SON K,KIM D,KANG W J,et al.QTRAN:Learning to facto-rize with transformation for cooperative multi-agent reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning.2019:5887-5896.
[25]SON K,AHN S,REYES R D,et al.QTRAN++:Improved value transformation for cooperative multi-agent reinforcement learning[J].arXiv:2006.12010,2020.
[26]BONET B,GEFFNER H.Planning as heuristic search[J].Artificial Intelligence,2001,129(1):5-33.
[27]RANGANATHAN A,RIABOV A,UDREA O.Mashup-basedinformation retrieval for domain experts[C]//Proceedings of the 18th ACM Conference on Information and Knowledge Mana-gement.2009:711-720.
[28]SOHRABI S,RIABOV A,KATZ M,et al.An AI planning solution to scenario generation for enterprise risk management[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[29]KATZ M,RAM P,SOHRABI S,et al.Exploring context-free languages via planning:The case for automating machine lear-ning[C]//Proceedings of the International Conference on Automated Planning and Scheduling.2020:403-411.
[30]GARRETT C R,CHITNIS R,HOLLADAY R,et al.Integrated task and motion planning[J].Annual Review of Control Robo-tics and Autonomous Systems,2021,4:265-293.
[31]LIU J,LIANG R S,XIAN J W.An AI planning approach to factory production planning and scheduling[C]//Proceedings of 2022 International Conference on Machine Learning and Know-ledge Engineering.2022:110-114.
[32]SILVER T,CHITNIS R.PDDLGym:Gym environments fromPDDL problems[J].arXiv:2002.06432,2020.
[33]RIVLIN O,HAZAN T,KARPAS E.Generalized planning with deep reinforcement learning[J].arXiv:2005.02305,2020.
[34]GEHRING C,ASAI M,CHITNIS R,et al.Reinforcement lear-ning for classical planning:Viewing heuristics as dense reward generators[C]//Proceedings of the International Conference on Automated Planning and Scheduling.2022:588-596.
[35]LEE J,KATZ M,AGRAVANTE D J,et al.AI planning annotation for sample efficient reinforcement learning[J].arXiv:2203.00669,2022.
[36]SUTTON R S,BARTO A G,WILLIAMS R J.Reinforcementlearning is direct adaptive optimal control[J].IEEE Control Systems Magazine,1992,12(2):19-22.
[37]LEWIS F L,VRABIE D,VAMVOUDAKIS K G.Reinforcement learning and feedback control:Using natural decision methods to design optimal adaptive controllers[J].IEEE Control Systems Magazine,2012,32(6):76-105.
[38]GEREVINI A E,HASLUM P,LONG D,et al.Deterministicplanning in the fifth international planning competition:PDDL3 and experimental evaluation of the planners[J].Artificial Intelligence,2009,173(5/6):619-668.
[39]HU Y J,WANG W X,JIA H T,et al.Learning to utilize shaping rewards:A new approach of reward shaping[C]//Procee-dings of the 34th International Conference on Neural Information Processing Systems.2020:15931-15941.
[40]SAMVELYAN M,RASHID T,WITT C S D,et al.The starcraft multi-agent challenge[C]//Proceedings of the 18th International Conference on Autonomous Agents and Multi-Agent Systems.2019:2186-2188.
[41]SUKHBAATAR S,SZLAM A,FERGUS R.Learning multiagent communication with backpropagation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:2252-2260.

Related Articles 13

[1]	SHI Dianxi, HU Haomeng, SONG Linna, YANG Huanhuan, OUYANG Qianying, TAN Jiefu , CHEN Ying. Multi-agent Reinforcement Learning Method Based on Observation Reconstruction [J]. Computer Science, 2024, 51(4): 280-290.
[2]	LUO Ruiqing, ZENG Kun, ZHANG Xinjing. Curriculum Learning Framework Based on Reinforcement Learning in Sparse HeterogeneousMulti-agent Environments [J]. Computer Science, 2024, 51(1): 301-309.
[3]	XIONG Liqin, CAO Lei, CHEN Xiliang, LAI Jun. Value Factorization Method Based on State Estimation [J]. Computer Science, 2023, 50(8): 202-208.
[4]	LIN Xiangyang, XING Qinghua, XING Huaixi. Study on Intelligent Decision Making of Aerial Interception Combat of UAV Group Based onMADDPG [J]. Computer Science, 2023, 50(6A): 220700031-7.
[5]	RONG Huan, QIAN Minfeng, MA Tinghuai, SUN Shengjie. Novel Class Reasoning Model Towards Covered Area in Given Image Based on InformedKnowledge Graph Reasoning and Multi-agent Collaboration [J]. Computer Science, 2023, 50(1): 243-252.
[6]	SHI Dian-xi, ZHAO Chen-ran, ZHANG Yao-wen, YANG Shao-wu, ZHANG Yong-jun. Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning [J]. Computer Science, 2022, 49(8): 247-256.
[7]	DU Wei, DING Shi-fei. Overview on Multi-agent Reinforcement Learning [J]. Computer Science, 2019, 46(8): 1-8.
[8]	BIAN Rui, WU Xiang-jun and CHEN Ai-xiang. Decomposition Strategy for Knowledge Tree of Predicate Based on Static Preconditions [J]. Computer Science, 2017, 44(1): 235-242.
[9]	. Expressive Temporal Planning Algorithm under Dynamic Constraint Satisfaction Framework [J]. Computer Science, 2012, 39(6): 226-230.
[10]	CHEN Yi-xiong,WU Zhong-fu,FEND Yong,ZHU Zheng-zhou. Learning-Task Scheduling Algorithm Based on CSP Model [J]. Computer Science, 2010, 37(12): 41-46.
[11]	FANG Qi-qingl, PENG Xiao-ming, LIU Qing-hua, HU Ya-hui. Study of Web Service Composition on Combining AI Planning with Workflow [J]. Computer Science, 2009, 36(9): 110-114.
[12]	. [J]. Computer Science, 2008, 35(1): 135-139.
[13]	ZHANG Pei-Yun ,SUN Ya-Min （School of Computer Science ＆ Technology, Nanjing University of Science ＆ Technology, Nanjing 210094）. [J]. Computer Science, 2007, 34(5): 4-7.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Multi-agent Reinforcement Learning Algorithm Based on AI Planning

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 13

Metrics

Comments

Recommended 0