计算机科学 ›› 2024, Vol. 51 ›› Issue (1): 301-309.doi: 10.11896/jsjkx.230500146

• 人工智能 • 上一篇    下一篇

稀疏异质多智能体环境下基于强化学习的课程学习框架

罗睿卿1, 曾坤1, 张欣景2   

  1. 1 中山大学计算机学院 广州510006
    2 中国人民解放军91976部队 广州510430
  • 收稿日期:2023-05-22 修回日期:2023-09-20 出版日期:2024-01-15 发布日期:2024-01-12
  • 通讯作者: 曾坤(zengkun2@mail.sysu.edu.cn)
  • 作者简介:(ruiqingluo1@163.com)
  • 基金资助:
    国家自然科学基金(U1711266);广东省基础与应用基础研究基金联合基金(2019A1515011078)

Curriculum Learning Framework Based on Reinforcement Learning in Sparse HeterogeneousMulti-agent Environments

LUO Ruiqing1, ZENG Kun1, ZHANG Xinjing2   

  1. 1 School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510006,China
    2 91976 Unit,People’s Liberation Army of China,Guangzhou 510430,China
  • Received:2023-05-22 Revised:2023-09-20 Online:2024-01-15 Published:2024-01-12
  • About author:LUO Ruiqing,born in 1995,postgra-duate.His main research interests include machine learning and reinforcement learning.ZENG Kun,born in 1982,Ph.D,asso-ciate professor.His main research in-terests include computer vision,machine learning,and graphics.
  • Supported by:
    National Natural Science Foundation of China(U1711266) and Guangdong Basic and Applied Basic Research Foundation(2019A1515011078).

摘要: 现代战争的战场较大且兵种较多,利用多智能体强化学习(MARL)进行战场推演可以加强作战单位之间的协同决策能力,从而提升战斗力。当前MARL在兵棋推演研究和对抗演练中的应用普遍存在两个简化:各个智能体的同质化以及作战单位分布稠密。实际战争场景中并不总是满足这两个设定,可能包含多种异质的智能体以及作战单位分布稀疏。为了探索强化学习在更多场景中的应用,分别就这两方面进行改进研究。首先,设计并实现了多尺度多智能体抢滩登陆环境M2ALE,M2ALE针对上述两个简化设定做了针对性的复杂化,添加了多种异质智能体和作战单位分布稀疏的场景,这两种复杂化设定加剧了多智能体环境的探索困难问题和非平稳性,使用常用的多智能体算法通常难以训练。其次,提出了一种异质多智能体课程学习框架HMACL,用于应对M2ALE环境的难点。HMACL包括3个模块:1)任务生成模块(STG),用于生成源任务以引导智能体训练;2)种类策略提升模块(CPI),针对多智能体系统本身的非平稳性,提出了一种基于智能体种类的参数共享(Class Based Parameter Sharing)策略,实现了异质智能体系统中的参数共享;3)训练模块(Trainer),通过从STG获取源任务,从CPI获取最新的策略,使用任意MARL算法训练当前的最新策略。HMACL可以缓解常用MARL算法在M2ALE环境中的探索难问题和非平稳性问题,引导多智能体系统在M2ALE环境中的学习过程。实验结果表明,使用HMACL使得MARL算法在M2ALE环境下的采样效率和最终性能得到大幅度的提升。

关键词: 多智能体强化学习, 作战仿真, 课程学习, 参数共享, 多智能体环境设计

Abstract: The battlefield of modern warfare is large and has a variety of units,and the use of multi-agent reinforcement learning(MARL) in battlefield simulation can enhance the collaborative decision-making ability among combat units and thus improve combat effectiveness.Current applications of Multi-agent reinforcement learning(MARL) in military simulation often rely on two simplifications:the homogeneity of agents and dense distribution of combat units,real-world warfare scenarios may not always adhere to these assumptions and may include various heterogeneous agents and sparsely distributed combat units.In order to explore the potential applications of reinforcement learning in a wider range of scenarios,this paper proposes improvements in these two aspects.Firstly,a multi-scale multi-agent amphibious landing environment(M2ALE) is designed to address the simplifications,incorporating various heterogeneous agents and scenarios with sparsely distributed combat units.These complex settings exacerbate the exploration difficulty and non-stationarity of multi-agent environments,making it difficult to train with commonly used multi-agent algorithms.Secondly,a heterogeneous multi-agent curriculum learning framework(HMACL) is proposed to address the challenges in the M2ALE environment.HMACL consists of three modules:source task generating(STG) module,class policy improving(CPI) module,and Trainer module.The STG module generates source tasks to guide agent training,while the CPI module proposes a class-based parameter sharing strategy to mitigate the non-stationarity of the multi-agent system and implement parameter sharing in a heterogeneous agent system.The Trainer module trains the latest policy using any MARL algorithm with the source tasks generated by the STG and the latest policy from the CPI.HMACL can alleviate the exploration difficulty and non-stationarity issues of commonly used MARL algorithms in the M2ALE environment and guide the learning process of the multi-agent system.Experiments show that using HMACL significantly improves the sampling efficiency and final performance of MARL algorithms in the M2ALE environment.

Key words: Multi-agent reinforcement learning, Combat simulation, Curriculum learning, Parameter sharing, Multi-agent environment design

中图分类号: 

  • TP183
[1]MORDATCH I,ABBEEL P.Emergence of grounded compositional language in multi-agent populations[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[2]SAMVELYAN M,RASHID T,DE WITT C S,et al.The starcraft multi-agent challenge[J].arXiv:1902.04043,2019.
[3]TERRY J,BLACK B,GRAMMEL N,et al.Pettingzoo:Gym for multi-agent reinforcement learning[J].Advances in Neural Information Processing Systems,2021,34:15032-15043.
[4]WANG B H,WU T Y,LI W H,et al.Large-scale UAVs Confrontation Based on Multi-agent Reinforcement Learning[J].Journal of System Simulation,2021,33(8):1739-1753.
[5]ZHENG L,YANG J,CAI H,et al.Magent:A many-agent reinforcement learning platform for artificial collective intelligence[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[6]LI Y.Deep reinforcement learning:An overview[J].arXiv:1701.07274,2017.
[7]WATKINS C J C H.Learning from delayed rewards[J/OL].https://d1wqtxts1xzle7.cloudfront.net/50360235/Learning_from_delayed_rewards_20161116-28282-v2pwvq-libre.pdf?1479337768=&response-content-disposition=inline%3B+fi-lename%3DLearning_from_delayed_rewards.pdf&Expires=1697437463&Signature=DTEgpQ1CwNQSh73wVPYhujim-RY-brTt06a6MNFrAhzcQnOQ8jPb5K8AbuSd4o5HwMabnNv0N7-weYKFszXWSgDgnHC73-jwDsGIT3KhsE9wbR8H1PUyqXlR-lkr~kapd2K5NF~yj92hGkbtHxVT5YCm4t8bC3LFSMZvrd-D0i5z1AIgd97DF94bUdJ-YoR9-Ag6eaADJWZmow6WKki8oKhAvyGoOY9~pJi94w4dKLww-IqnrGNhiSCCITANWMVeH7rc5x-1MsDfd1iP31vWrdlDpF71nn1uh28tm35rr03HmESv4Tbnt-RxG410d4E7QeUe31ItR8Htrq5CWiIITEnulLBcg__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA.
[8]SUNEHAG P,LEVER G,GRUSLYS A,et al.Value-decomposition networks for cooperative multi-agent learning[J].arXiv:1706.05296,2017.
[9]RASHID T,SAMVELYAN M,DE WITT C S,et al.Monotonic value function factorisation for deep multi-agent reinforcement learning[J].The Journal of Machine Learning Research,2020,21(1):7234-7284.
[10]TERRY J K,GRAMMEL N,HARI A,et al.Revisiting parameter sharing in multi-agent deep reinforcement learning[J].ar-Xiv:2005.13625,2020.
[11]GUPTA J K,EGOROV M,KOCHENDERFER M.Cooperative multi-agent control using deep reinforcement learning[C]//Autonomous Agents and Multiagent Systems:AAMAS 2017 Workshops,Best Papers,São Paulo,Brazil,May 8-12,2017,Revised Selected Papers 16.Springer International Publishing,2017:66-83.
[12]CHRISTIANOS F,PAPOUDAKIS G,RAHMAN M A,et al.Scaling multi-agent reinforcement learning with selective para-meter sharing[C]//International Conference on Machine Lear-ning.PMLR,2021:1989-1998.
[13]DORRI A,KANHERE S S,JURDAK R.Multi-agent systems:A survey[J].IEEEAccess,2018,6:28573-28593.
[14]ZHENG Y,ZHU Y,WANG L.Consensus of heterogeneousmulti-agent systems[J].IET Control Theory & Applications,2011,5(16):1881-1888.
[15]PORTELAS R,COLAS C,WENG L,et al.Automatic curriculum learning for deep rl:A short survey[J].arXiv:2003.04664,2020.
[16]LIU I J,JAIN U,YEH R A,et al.Cooperative exploration for multi-agent deep reinforcement learning[C]//International Conference on Machine Learning.PMLR,2021:6826-6836.
[17]DENNIS M,JAQUES N,VINITSKY E,et al.Emergent complexity and zero-shot transfer via unsupervised environment design[J].Advances in Neural Information Processing Systems,2020,33:13049-13061.
[18]YU W W,YANG X Y,LI H C,et al.Attentional Intention and Communication for Multi-Agent Learning[J].Acta Automatica Sinica,2021,47:1-16.
[19]ZANG R,WANG L,SHI T F.Multiagent reinforcement lear-ning based on attentional message sharing[J].Journal of Compu-ter Applications,2022,42(11):3346-3353.
[20]ZHAO Y P,FAN Z J.Research into The Evaluation Method of Naval Warfare Based on Simulation Deduction[J].Shipboard Electronic Countermeasure,2019,42(3):1-4.
[21]XIAO Z,ZHANG S Y.Reinforcement Learning Model Based on Regret for Multi-Agent Conflict Games[J].Journal of Software,2008,19(11):2957-2967.
[22]DU H W,CUI M L,HAN T,et al.Maneuvering decision in air combat based on multi-objective optimization and reinforcement learning[J].Journal of Beijing University of Aeronautics and Astronautics,2018,44(11):2247-2256.
[24]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[25]TAN M.Multi-agent reinforcement learning:Independent vs.cooperative agents[C]//Proceedings of the Tenth International Conference on Machine Learning.1993:330-337.
[26]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.PMLR,2016:1928-1937.
[27]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximalpolicy optimization algorithms[J].arXiv:1707.06347,2017.
[28]LOWE R,WU Y I,TAMAR A,et al.Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Advances in Neural Information Processing Systems.2017.
[29]YU C,VELU A,VINITSKY E,et al.The surprising effectiveness of ppo in cooperative multi-agent games[J].Advances in Neural Information Processing Systems,2022,35:24611-24624.
[30]FOERSTER J,FARQUHAR G,AFOURAS T,et al.Counterfactual multi-agent policy gradients[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[31]SUNEHAG P,LEVER G,GRUSLYS A,et al.Value-decomposition networks for cooperative multi-agent learning[J].arXiv:1706.05296,2017.
[32]RASHID T,SAMVELYAN M,DE WITT C S,et al.Monotonic value function factorisation for deep multi-agent reinforcement learning[J].The Journal of Machine Learning Research,2020,21(1):7234-7284.
[33]CHRISTIANOS F,PAPOUDAKIS G,RAHMAN M A,et al.Scaling multi-agent reinforcement learning with selective para-meter sharing[C]//International Conference on Machine Learning.PMLR,2021:1989-1998.
[34]NARVEKAR S,SINAPOV J,LEONETTI M,et al.Source task creation for curriculum learning[C]//Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems.2016:566-574.
[35]DENNIS M,JAQUES N,VINITSKY E,et al.Emergent com-plexity and zero-shot transfer via unsupervised environment design[J].Advances in Neural Information Processing Systems,2020,33:13049-13061.
[36]SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized experience replay[J].arXiv:1511.05952,2015.
[37]ANDRYCHOWICZ M,WOLSKI F,RAY A,et al.Hindsight experience replay[C]//Advances in Neural Information Processing Systems.2017.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!