计算机科学 ›› 2023, Vol. 50 ›› Issue (8): 202-208.doi: 10.11896/jsjkx.220500270
熊丽琴, 曹雷, 陈希亮, 赖俊
XIONG Liqin, CAO Lei, CHEN Xiliang, LAI Jun
摘要: 值分解方法是一种流行的解决合作多智能体深度强化学习问题的方法,其核心是基于IGM(Individual-Global-Max)原则将联合值函数表示为个体值函数的某种组合。该方法中,智能体仅根据基于局部观察的个体值函数选择动作,这导致智能体无法有效地利用全局状态信息学习策略。尽管许多值分解算法已经采用了注意力机制、超网络等手段来提取全局状态的特征以加权个体值函数,从而间接地利用全局信息来指导智能体训练,但这种利用非常有限。在复杂环境中,智能体仍旧难以学到有效策略,学习效率较差。为提高智能体策略学习能力,提出了一种基于状态估计的多智能体深度强化学习值分解方法——SE-VF(Value Factorization based on State Estimation),该方法引入状态估计网络来提取全局状态的特征并得到评估全局状态优劣的状态值,然后将状态损失值作为损失函数的一部分来更新智能体网络的参数,从而优化智能体的策略选择过程。实验结果表明,在星际争霸2微观管理任务测试平台的多个场景中,SE-VF的表现比QMIX等基线更好。
中图分类号:
[1]SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural networks and tree search[J].Nature,2016,529(7587):484-489. [2]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533. [3]LI Y,XU F,XIE G Q,et al.Survey of development and application of multi-agent technology[J].Computer Engineering and Applications,2018,54(9):13-21. [4]SUN Y,CAO L,CHEN X L,et al.Overview of multi-agent deep reinforcement learning[J].Computer engineering and Application,2020,56(5):13-24. [5]SUNEHAG P,LEVER G,GRUSLYS A,et al.Value-Decomposition Networks For Cooperative Multi-Agent Learning Based on Team Reward[C]//Proceedings of the 17th International Conference on Autonomous Agents and Multi-Agent Systems.2018:2085-2087. [6]RASHID T,SAMVELYAN M,SCHROEDER C,et al.Qmix:Monotonic value function factorisation for deep multi-agent reinforcement learning[C]//International Conference on Machine Learning.2018:4295-4304. [7]FOERSTER J,FARQUHAR G,AFOURAS T,et al.Counterfactual multi-agent policy gradients[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018:2974-2982. [8]TAMPUU A,MATIISEN T,KODELJA D,et al.Multiagentcooperation and competition with deep reinforcement learning[J].PloS one,2017,12(4):e0172395. [9]RASHID T,FARQUHAR G,PENG B,et al.Weighted QMIX:Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning[C]//Advances in Neural Information Processing Systems.2020:10199-10210. [10]IQBAL S,WITT C S D,PENG B,et al.AI-QMIX:Attentionand Imagination for Dynamic Multi-Agent Reinforcement Lear-ning[J].arXiv:2006.04222,2020. [11]ZHAO J,YANG M,HU X,et al.DQMIX:A Distributional Pers-pective on Multi-Agent Reinforcement Learning[J].arXiv:2202.10134,2022. [12]YAO X,WEN C,WANG Y,et al.SMIX(λ):Enhancing Centra-lized Value Functions for Cooperative Multi-Agent Reinforcement Learning[J].IEEE Transactions on Neural Networks and Learning Systems,2021,6:1-12. [13]SON K,KIM D,KANG W J,et al.Qtran:Learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]//International Conference on Machine Learning.2019:5887-5896. [14]SON K,AHN S,REYES R D,et al.QTRAN++:Improved Value Transformation for Cooperative Multi-Agent Reinforcement Learning[J].arXiv:2006.12010,2020. [15]YANG Y,HAO J,LIAO B,et al.Qatten:A general framework for cooperative multiagent reinforcement learning[J].arXiv:2002.03939,2020. [16]ZHANG Y,MA H,WANG Y.AVD-Net:Attention Value Decomposition Network For Deep Multi-Agent Reinforcement Learning[C]//2020 25th International Conference on Pattern Recognition(ICPR).2021:7810-7816. [17]WANG J,REN Z,LIU T,et al.QPLEX:Duplex Dueling Multi-Agent Q-Learning[J].arXiv:2008.01062,2020. [18]IQBAL S,DE WITT C A S,PENG B,et al.Randomized Entity-wise Factorization for Multi-Agent Reinforcement Learning[C]//International Conference on Machine Learning.2021:4596-4606. [19]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008. [20]XU Z,LI D,BAI Y,et al.MMD-MIX:Value Function Factorisation with Maximum Mean Discrepancy for Cooperative Multi-Agent Reinforcement Learning[C]//2021 International Joint Conference on Neural Networks(IJCNN).2021:1-7. [21]FOERSTER J N,ASSAEL Y M,DE FREITAS N,et al.Lear-ning to communicate with Deep multi-agent reinforcement lear-ning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:2145-2153. [22]WU B,YANG X,SUN C,et al.Learning Effective Value Function Factorization via Attentional Communication[C]//2020 IEEE International Conference on Systems,Man,and Cyberne-tics(SMC).2020:629-634. [23]ZHOU H,LAN T,AGGARWAL V.Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients[J].arXiv:2201.01247,2022. [24]OLIEHOEK F A,SPAAN M T,VLASSIS N.Optimal and Approximate Q-value Functions for Decentralized POMDPs[J].Journal of Artificial Intelligence Research,2008,32:289-353. [25]HAUSKNECHT M,STONE P.Deep recurrent Q-learningfor partially observable mdps[C]//2015 AAAI Fall Symposium Series.2015:29-37. |
|