计算机科学 ›› 2024, Vol. 51 ›› Issue (1): 301-309.doi: 10.11896/jsjkx.230500146
罗睿卿1, 曾坤1, 张欣景2
LUO Ruiqing1, ZENG Kun1, ZHANG Xinjing2
摘要: 现代战争的战场较大且兵种较多,利用多智能体强化学习(MARL)进行战场推演可以加强作战单位之间的协同决策能力,从而提升战斗力。当前MARL在兵棋推演研究和对抗演练中的应用普遍存在两个简化:各个智能体的同质化以及作战单位分布稠密。实际战争场景中并不总是满足这两个设定,可能包含多种异质的智能体以及作战单位分布稀疏。为了探索强化学习在更多场景中的应用,分别就这两方面进行改进研究。首先,设计并实现了多尺度多智能体抢滩登陆环境M2ALE,M2ALE针对上述两个简化设定做了针对性的复杂化,添加了多种异质智能体和作战单位分布稀疏的场景,这两种复杂化设定加剧了多智能体环境的探索困难问题和非平稳性,使用常用的多智能体算法通常难以训练。其次,提出了一种异质多智能体课程学习框架HMACL,用于应对M2ALE环境的难点。HMACL包括3个模块:1)任务生成模块(STG),用于生成源任务以引导智能体训练;2)种类策略提升模块(CPI),针对多智能体系统本身的非平稳性,提出了一种基于智能体种类的参数共享(Class Based Parameter Sharing)策略,实现了异质智能体系统中的参数共享;3)训练模块(Trainer),通过从STG获取源任务,从CPI获取最新的策略,使用任意MARL算法训练当前的最新策略。HMACL可以缓解常用MARL算法在M2ALE环境中的探索难问题和非平稳性问题,引导多智能体系统在M2ALE环境中的学习过程。实验结果表明,使用HMACL使得MARL算法在M2ALE环境下的采样效率和最终性能得到大幅度的提升。
中图分类号:
[1]MORDATCH I,ABBEEL P.Emergence of grounded compositional language in multi-agent populations[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018. [2]SAMVELYAN M,RASHID T,DE WITT C S,et al.The starcraft multi-agent challenge[J].arXiv:1902.04043,2019. [3]TERRY J,BLACK B,GRAMMEL N,et al.Pettingzoo:Gym for multi-agent reinforcement learning[J].Advances in Neural Information Processing Systems,2021,34:15032-15043. [4]WANG B H,WU T Y,LI W H,et al.Large-scale UAVs Confrontation Based on Multi-agent Reinforcement Learning[J].Journal of System Simulation,2021,33(8):1739-1753. [5]ZHENG L,YANG J,CAI H,et al.Magent:A many-agent reinforcement learning platform for artificial collective intelligence[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018. [6]LI Y.Deep reinforcement learning:An overview[J].arXiv:1701.07274,2017. [7]WATKINS C J C H.Learning from delayed rewards[J/OL].https://d1wqtxts1xzle7.cloudfront.net/50360235/Learning_from_delayed_rewards_20161116-28282-v2pwvq-libre.pdf?1479337768=&response-content-disposition=inline%3B+fi-lename%3DLearning_from_delayed_rewards.pdf&Expires=1697437463&Signature=DTEgpQ1CwNQSh73wVPYhujim-RY-brTt06a6MNFrAhzcQnOQ8jPb5K8AbuSd4o5HwMabnNv0N7-weYKFszXWSgDgnHC73-jwDsGIT3KhsE9wbR8H1PUyqXlR-lkr~kapd2K5NF~yj92hGkbtHxVT5YCm4t8bC3LFSMZvrd-D0i5z1AIgd97DF94bUdJ-YoR9-Ag6eaADJWZmow6WKki8oKhAvyGoOY9~pJi94w4dKLww-IqnrGNhiSCCITANWMVeH7rc5x-1MsDfd1iP31vWrdlDpF71nn1uh28tm35rr03HmESv4Tbnt-RxG410d4E7QeUe31ItR8Htrq5CWiIITEnulLBcg__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA. [8]SUNEHAG P,LEVER G,GRUSLYS A,et al.Value-decomposition networks for cooperative multi-agent learning[J].arXiv:1706.05296,2017. [9]RASHID T,SAMVELYAN M,DE WITT C S,et al.Monotonic value function factorisation for deep multi-agent reinforcement learning[J].The Journal of Machine Learning Research,2020,21(1):7234-7284. [10]TERRY J K,GRAMMEL N,HARI A,et al.Revisiting parameter sharing in multi-agent deep reinforcement learning[J].ar-Xiv:2005.13625,2020. [11]GUPTA J K,EGOROV M,KOCHENDERFER M.Cooperative multi-agent control using deep reinforcement learning[C]//Autonomous Agents and Multiagent Systems:AAMAS 2017 Workshops,Best Papers,São Paulo,Brazil,May 8-12,2017,Revised Selected Papers 16.Springer International Publishing,2017:66-83. [12]CHRISTIANOS F,PAPOUDAKIS G,RAHMAN M A,et al.Scaling multi-agent reinforcement learning with selective para-meter sharing[C]//International Conference on Machine Lear-ning.PMLR,2021:1989-1998. [13]DORRI A,KANHERE S S,JURDAK R.Multi-agent systems:A survey[J].IEEEAccess,2018,6:28573-28593. [14]ZHENG Y,ZHU Y,WANG L.Consensus of heterogeneousmulti-agent systems[J].IET Control Theory & Applications,2011,5(16):1881-1888. [15]PORTELAS R,COLAS C,WENG L,et al.Automatic curriculum learning for deep rl:A short survey[J].arXiv:2003.04664,2020. [16]LIU I J,JAIN U,YEH R A,et al.Cooperative exploration for multi-agent deep reinforcement learning[C]//International Conference on Machine Learning.PMLR,2021:6826-6836. [17]DENNIS M,JAQUES N,VINITSKY E,et al.Emergent complexity and zero-shot transfer via unsupervised environment design[J].Advances in Neural Information Processing Systems,2020,33:13049-13061. [18]YU W W,YANG X Y,LI H C,et al.Attentional Intention and Communication for Multi-Agent Learning[J].Acta Automatica Sinica,2021,47:1-16. [19]ZANG R,WANG L,SHI T F.Multiagent reinforcement lear-ning based on attentional message sharing[J].Journal of Compu-ter Applications,2022,42(11):3346-3353. [20]ZHAO Y P,FAN Z J.Research into The Evaluation Method of Naval Warfare Based on Simulation Deduction[J].Shipboard Electronic Countermeasure,2019,42(3):1-4. [21]XIAO Z,ZHANG S Y.Reinforcement Learning Model Based on Regret for Multi-Agent Conflict Games[J].Journal of Software,2008,19(11):2957-2967. [22]DU H W,CUI M L,HAN T,et al.Maneuvering decision in air combat based on multi-objective optimization and reinforcement learning[J].Journal of Beijing University of Aeronautics and Astronautics,2018,44(11):2247-2256. [24]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533. [25]TAN M.Multi-agent reinforcement learning:Independent vs.cooperative agents[C]//Proceedings of the Tenth International Conference on Machine Learning.1993:330-337. [26]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning.PMLR,2016:1928-1937. [27]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximalpolicy optimization algorithms[J].arXiv:1707.06347,2017. [28]LOWE R,WU Y I,TAMAR A,et al.Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Advances in Neural Information Processing Systems.2017. [29]YU C,VELU A,VINITSKY E,et al.The surprising effectiveness of ppo in cooperative multi-agent games[J].Advances in Neural Information Processing Systems,2022,35:24611-24624. [30]FOERSTER J,FARQUHAR G,AFOURAS T,et al.Counterfactual multi-agent policy gradients[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018. [31]SUNEHAG P,LEVER G,GRUSLYS A,et al.Value-decomposition networks for cooperative multi-agent learning[J].arXiv:1706.05296,2017. [32]RASHID T,SAMVELYAN M,DE WITT C S,et al.Monotonic value function factorisation for deep multi-agent reinforcement learning[J].The Journal of Machine Learning Research,2020,21(1):7234-7284. [33]CHRISTIANOS F,PAPOUDAKIS G,RAHMAN M A,et al.Scaling multi-agent reinforcement learning with selective para-meter sharing[C]//International Conference on Machine Learning.PMLR,2021:1989-1998. [34]NARVEKAR S,SINAPOV J,LEONETTI M,et al.Source task creation for curriculum learning[C]//Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems.2016:566-574. [35]DENNIS M,JAQUES N,VINITSKY E,et al.Emergent com-plexity and zero-shot transfer via unsupervised environment design[J].Advances in Neural Information Processing Systems,2020,33:13049-13061. [36]SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized experience replay[J].arXiv:1511.05952,2015. [37]ANDRYCHOWICZ M,WOLSKI F,RAY A,et al.Hindsight experience replay[C]//Advances in Neural Information Processing Systems.2017. |
|