计算机科学 ›› 2023, Vol. 50 ›› Issue (8): 202-208.doi: 10.11896/jsjkx.220500270

• 人工智能 • 上一篇    下一篇

基于状态估计的值分解方法

熊丽琴, 曹雷, 陈希亮, 赖俊   

  1. 陆军工程大学指挥控制工程学院 南京 210007
  • 收稿日期:2022-05-30 修回日期:2022-09-05 出版日期:2023-08-15 发布日期:2023-08-02
  • 通讯作者: 曹雷(caolei_nj2022@126.com)
  • 作者简介:(x18779557924@126.com)
  • 基金资助:
    国家自然科学基金(61806221)

Value Factorization Method Based on State Estimation

XIONG Liqin, CAO Lei, CHEN Xiliang, LAI Jun   

  1. College of Command and Control Engineering,Army Engineering University,Nanjing 210007,China
  • Received:2022-05-30 Revised:2022-09-05 Online:2023-08-15 Published:2023-08-02
  • About author:XIONG Liqin,born in 1997,postgra-duate.Her main research interests include multi-agent deep reinforcement and intelligent command and control.
    CAO Lei,born in 1965,Ph.D,professor,Ph.D supervisor.His main research interests include machine learning,command information system and intelligent decision making.
  • Supported by:
    National Natural Science Foundation of China(61806221).

摘要: 值分解方法是一种流行的解决合作多智能体深度强化学习问题的方法,其核心是基于IGM(Individual-Global-Max)原则将联合值函数表示为个体值函数的某种组合。该方法中,智能体仅根据基于局部观察的个体值函数选择动作,这导致智能体无法有效地利用全局状态信息学习策略。尽管许多值分解算法已经采用了注意力机制、超网络等手段来提取全局状态的特征以加权个体值函数,从而间接地利用全局信息来指导智能体训练,但这种利用非常有限。在复杂环境中,智能体仍旧难以学到有效策略,学习效率较差。为提高智能体策略学习能力,提出了一种基于状态估计的多智能体深度强化学习值分解方法——SE-VF(Value Factorization based on State Estimation),该方法引入状态估计网络来提取全局状态的特征并得到评估全局状态优劣的状态值,然后将状态损失值作为损失函数的一部分来更新智能体网络的参数,从而优化智能体的策略选择过程。实验结果表明,在星际争霸2微观管理任务测试平台的多个场景中,SE-VF的表现比QMIX等基线更好。

关键词: 状态估计, 值分解, 多智能体强化学习, 深度强化学习

Abstract: Value factorization is a popular method to solve cooperative multi-agent deep reinforcement learning problems,which factorizes joint value function into individual value functions according to IGM principle.In this method,agents select actions only according to individual value functions based on local observation,which leads to agents cannot effectively use global information to learn strategy.Although many value factorization algorithms extract the features of global state to weight individual value functions by many approaches,including attention mechanism,super network,and et al,so as to indirectly utilize global information to train agents,but this utilization is pretty limited.In a complex environment,it is difficult for agents to learn effective stra-tegies and their learning efficiency is poor.In order to improve agents' policy learning ability,an optimized value factorization method based on state estimation(SE-VF) is put forward,which introduces a state network to extract the features of global state and get a state value,and then take state loss value as part of the loss function to update agents network parameters,so as to optimize the strategy selection process of agents.Experimental results show that SE-VF performs better than QMIX and other baselines in multiple scenarios of the StarCraft 2 micromanagement mission test platform.

Key words: State estimation, Value factorization, Multi-agent reinforcement learning, Deep reinforcement learning

中图分类号: 

  • TP181
[1]SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural networks and tree search[J].Nature,2016,529(7587):484-489.
[2]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[3]LI Y,XU F,XIE G Q,et al.Survey of development and application of multi-agent technology[J].Computer Engineering and Applications,2018,54(9):13-21.
[4]SUN Y,CAO L,CHEN X L,et al.Overview of multi-agent deep reinforcement learning[J].Computer engineering and Application,2020,56(5):13-24.
[5]SUNEHAG P,LEVER G,GRUSLYS A,et al.Value-Decomposition Networks For Cooperative Multi-Agent Learning Based on Team Reward[C]//Proceedings of the 17th International Conference on Autonomous Agents and Multi-Agent Systems.2018:2085-2087.
[6]RASHID T,SAMVELYAN M,SCHROEDER C,et al.Qmix:Monotonic value function factorisation for deep multi-agent reinforcement learning[C]//International Conference on Machine Learning.2018:4295-4304.
[7]FOERSTER J,FARQUHAR G,AFOURAS T,et al.Counterfactual multi-agent policy gradients[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018:2974-2982.
[8]TAMPUU A,MATIISEN T,KODELJA D,et al.Multiagentcooperation and competition with deep reinforcement learning[J].PloS one,2017,12(4):e0172395.
[9]RASHID T,FARQUHAR G,PENG B,et al.Weighted QMIX:Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning[C]//Advances in Neural Information Processing Systems.2020:10199-10210.
[10]IQBAL S,WITT C S D,PENG B,et al.AI-QMIX:Attentionand Imagination for Dynamic Multi-Agent Reinforcement Lear-ning[J].arXiv:2006.04222,2020.
[11]ZHAO J,YANG M,HU X,et al.DQMIX:A Distributional Pers-pective on Multi-Agent Reinforcement Learning[J].arXiv:2202.10134,2022.
[12]YAO X,WEN C,WANG Y,et al.SMIX(λ):Enhancing Centra-lized Value Functions for Cooperative Multi-Agent Reinforcement Learning[J].IEEE Transactions on Neural Networks and Learning Systems,2021,6:1-12.
[13]SON K,KIM D,KANG W J,et al.Qtran:Learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]//International Conference on Machine Learning.2019:5887-5896.
[14]SON K,AHN S,REYES R D,et al.QTRAN++:Improved Value Transformation for Cooperative Multi-Agent Reinforcement Learning[J].arXiv:2006.12010,2020.
[15]YANG Y,HAO J,LIAO B,et al.Qatten:A general framework for cooperative multiagent reinforcement learning[J].arXiv:2002.03939,2020.
[16]ZHANG Y,MA H,WANG Y.AVD-Net:Attention Value Decomposition Network For Deep Multi-Agent Reinforcement Learning[C]//2020 25th International Conference on Pattern Recognition(ICPR).2021:7810-7816.
[17]WANG J,REN Z,LIU T,et al.QPLEX:Duplex Dueling Multi-Agent Q-Learning[J].arXiv:2008.01062,2020.
[18]IQBAL S,DE WITT C A S,PENG B,et al.Randomized Entity-wise Factorization for Multi-Agent Reinforcement Learning[C]//International Conference on Machine Learning.2021:4596-4606.
[19]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[20]XU Z,LI D,BAI Y,et al.MMD-MIX:Value Function Factorisation with Maximum Mean Discrepancy for Cooperative Multi-Agent Reinforcement Learning[C]//2021 International Joint Conference on Neural Networks(IJCNN).2021:1-7.
[21]FOERSTER J N,ASSAEL Y M,DE FREITAS N,et al.Lear-ning to communicate with Deep multi-agent reinforcement lear-ning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:2145-2153.
[22]WU B,YANG X,SUN C,et al.Learning Effective Value Function Factorization via Attentional Communication[C]//2020 IEEE International Conference on Systems,Man,and Cyberne-tics(SMC).2020:629-634.
[23]ZHOU H,LAN T,AGGARWAL V.Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients[J].arXiv:2201.01247,2022.
[24]OLIEHOEK F A,SPAAN M T,VLASSIS N.Optimal and Approximate Q-value Functions for Decentralized POMDPs[J].Journal of Artificial Intelligence Research,2008,32:289-353.
[25]HAUSKNECHT M,STONE P.Deep recurrent Q-learningfor partially observable mdps[C]//2015 AAAI Fall Symposium Series.2015:29-37.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!