基于值分解的多智能体深度强化学习综述

doi:10.11896/jsjkx.210800112

摘要/Abstract

摘要： 基于值分解的多智能体深度强化学习是众多多智能体深度强化学习算法中的一类,也是多智能体深度强化学习领域的一个研究热点。它利用某种约束将多智能体系统的联合动作值函数分解为个体动作值函数的某种特定组合,能够有效解决多智能体系统中的环境非稳定性和动作空间指数爆炸等问题。文中首先说明了进行值函数分解的原因;其次,介绍了多智能体深度强化学习的基本理论;接着根据是否引入其他机制以及引入机制的不同将基于值分解的多智能体深度强化学习算法分为3类:简单因子分解型、基于IGM(个体-全局-最大)原则型以及基于注意力机制型;然后按分类重点介绍了几种典型算法并对算法的优缺点进行对比分析;最后简要阐述了所提算法的应用和发展前景。

关键词: 值函数分解, 多智能体深度强化学习, 注意力机制, IGM原则

Abstract: Multi-agent deep reinforcement learning based on value factorization is one of many multi-agent deep reinforcement learning algorithms,and it is also a research hotspot in the field of multi-agent deep reinforcement learning.Under some constraints,the joint action value function of multi-agent system is factorized into a certain combination of individual action value function,which is able to effectively solve the problems of environment instability and exponential explosion of action space in multi-agent system.Firstly,this paper explains why value function factorization should be carried out and introduces the basic theory of multi-agent deep reinforcement learning.Secondly,according to whether to introduce other mechanisms and the diffe-rence of introduced mechanism,multi-agent deep reinforcement learning(MADRL)algorithm based on value factorization is divi-ded into three categories:simple factorization type,based on the individual-global-max(IGM)principle and based on attention mechanism.Then,according to the classifications,this paper emphatically introduces several typical algorithms and compares and analyzes their strengths and weaknesses.Finally,it briefly describes the application and development prospect of these algorithms.

Key words: Factorization of value function, MADRL, Attention mechanism, Principle of IGM

中图分类号:

TP181

熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述[J]. 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112

XIONG Li-qin, CAO Lei, LAI Jun, CHEN Xi-liang. Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization[J]. Computer Science, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112

参考文献

[1]SUN Y,CAO L,CHEN X L,et al.Overview of multi-agent deep reinforcement learning[J].Computer Engineering and Application,2020,56(5):13-24.
[2]SUTTON R S,BARTO A G.Introduction to reinforcementlearning[M].Cambridge:MIT press,1998.
[3]HENDERSON P,ISLAM R,BACHMAN P,et al.Deep reinforcement learning that matters[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018,3207-3214.
[4]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436-444.
[5]EGOROV M.Multi-agent deep reinforcement learning[R].Stanford University:EGOROV M,2016:1-8.
[6]SUN C Y,MU Z X.Important Scientific Problems of Multi-Agent Deep Reinforcement Learning[J].Acta Automatica Sinica,2020,46(7):1301-1312.
[7]NGUYEN T T,NGUYEN N D,NAHAVANDI S.Deep reinforcement learning for multiagent systems:A review of challenges,solutions,and applications[J].IEEE Transactions on Cybernetics,2020,50(9):3826-3839.
[8]TAMPUU A,MATIISEN T,KODELJA D,et al.Multiagentcooperation and competition with deep reinforcement learning[J].PloS One,2017,12(4):e0172395.
[9]FOERSTER J,ASSAEL I.A,DE FREITAS N,et al.Learning to communicate with deep multi-agent reinforcement learning[C]//Advances in Neural Information Processing Systems.2016:2137-2145.
[10]GUPTA J K,EGOROV M,KOCHENDERFER M.Cooperative multi-agent control using deep reinforcement learning[C]//International Conference on Autonomous Agents and Multiagent Systems.Cham:Springer,2017:66-83.
[11]LEIBO Z,ZAMBALDI V,LANCTOT M,et al.Multi-agent reinforcement learning in sequential social dilemmas[C]//Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems.International Foundation for Autonomous Agents and Multiagent Systems,2017:464-473.
[12]ZHANG K Q,YANG Z R,BAAR T.Decentralized multi-agent reinforcement learning with networked agents:recent advances[J].Frontiers of Information Technology & Electronic Engineering,2021,22:802-814.
[13]STANKOVIĆ M S,BEKO M,STANKOVIĆ S S.DistributedValue Function Approximation for Collaborative Multiagent Reinforcement Learning[J].IEEE Transactions on Control of Network Systems,2021,8(3):1270-1280.
[14]SUNEHAG P,LEVER G,GRUSLYS A,et al.Value decomposition networks for cooperative multi-agent learning based on team reward[C]//Proceedings of AAMAS.2018:2085-2087.
[15]RASHID T,SAMVELYAN M,SCHROEDER C,et al.Qmix:Monotonic value function factorisation for deep multi-agent reinforcement learning[C]//International Conference on Machine Learning.PMLR,2018:4295-4304.
[16]PAPOUDAKIS G,CHRISTIANOS F,SCHÄFER L,et al.Comparative evaluation of cooperative multi-agent deep reinforcement learning algorithms[J].arXiv:2006.07869,2020.
[17]WANG J,REN Z,HAN B,et al.Towards Understanding Co-operative Multi-Agent Q-Learning with Value Factorization [C]//Advances in Neural Information Processing Systems.2021:29142-29155.
[18]WANG J,REN Z,LIU T,et al.QPLEX:Duplex Dueling Multi-Agent Q-Learning[J].arXiv:2008.01062,2020.
[19]SUTTON R S.Learning to predict by the methods of temporal differences[J].Machine Learning,1988,3(1):9-44.
[20]WATKINS C,DAYAN P.Q-learning[J].Machine Learning,1992,8(3/4):279-292.
[21]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[22]HASSELT H V,GUEZ A,SILVER D.Deep reinforcementlearning with double q-learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2016,30(1):2094-2100.
[23]LIPTON Z,LI X,GAO J,et al.BBQ-Networks:Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems[C]//Thirty-Second AAAI Conference on Artificial Intelligence.2018:5237-5244.
[24]ANSCHEL O,BARAM N,SHIMKIN N.Averaged-DQN:Va-riance Reduction and Stabilization for Deep Reinforcement Learning[C]//Proceedings of the 34th International Conference on Machine Learning.2017:176-185.
[25]WANG Z,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[C]//International Conference on Machine Learning.PMLR,2016:1995-2003.
[26]HAUSKNECHT M,STONE P.Deep recurrent q-learning forpartially observable MDPs[C]//2015 AAAI Fall Symposium Series.2015:29-37.
[27]NAIR A,SRINIVASAN P,BLACKWELL S,et al.Massivelyparallel methods for deep reinforcement learning[J].arXiv:1507.04296,2015.
[28]SOROKIN I,SELEZNEV A,PAVLOV M,et al.Deep attention recurrent Q-network[J].arXiv:1512.01693,2015.
[29]OLIEHOEK F A,SPAAN M T J,VLASSIS N.Optimal and approximate Q-value functions for decentralized POMDPs[J].Journal of Artificial Intelligence Research,2008,32:289-353.
[30]OROOJLOOYJADID A,HAJINEZHAD D.A review of coope-rative multi-agent deep reinforcement learning[J].arXiv:1908.03963,2019.
[31]WANG Q L,PSILLAKIS H E,SUN C Y.Cooperative control of multiple agents with unknown high-frequency gain signs under unbalanced and switching topologies[J].IEEE Transactions on Automatic Control,2019,64(6):2495-2501.
[32]BU ŞONIU L,BABU?KA R,DE SCHUTTER B.Multi-agentreinforcement learning:An overview[C]//IEEE Transactions on Systems,Man,and Cybernetics—Part C:Applications and Reviews,2008,38(2):156-172.
[33]FOERSTER J,FARQUHAR G,AFOURAS T,et al.Counterfactual multi-agent policy gradients[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018:2974-2982.
[34]RASHID T,FARQUHAR G,PENG B,et al.Weighted QMIX:Expanding monotonic value function factorisation for deep multi-agent reinforcement learning[C]//Advances in Neural Information Processing Systems.2020:10199-10210.
[35]SHAO K,ZHU Y,TANG Z,et al.Cooperative Multi-AgentDeep Reinforcement Learning with Counterfactual Reward[C]//2020 International Joint Conference on Neural Networks(IJCNN).IEEE,2020:1-8.
[36]SON K,KIM D,KANG W J,et al.Qtran:Learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]//International Conference on Machine Learning.PMLR,2019:5887-5896.
[37]SON K,AHN S,REYES R D,et al.QTRAN++:Improved Value Transformation for Cooperative Multi-Agent Reinforcement Learning[J].arXiv:2006.12010,2020.
[38]SUN W F,LEE C K,LEE C Y.A Distributional Perspective on Value Function Factorization Methods for Multi-Agent Reinforcement Learning[C]//Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems.2021:1671-1673.
[39]YANG Y,HAO J,LIAO B,et al.Qatten:A general framework for cooperative multiagent reinforcement learning[J].arXiv:2002.03939,2020.
[40]IQBAL S,DE WITT C A S,PENG B,et al.Randomized Entity-wise Factorization for Multi Agent Reinforcement Learning[C]//International Conference on Machine Learning.PMLR,2021:4596-4606.
[41]ZHANG Y,MA H,WANG Y.AVD-Net:Attention Value Decomposition Network For Deep Multi-Agent Reinforcement Learning[C]//2020 25th International Conference on Pattern Recognition(ICPR).IEEE,2021:7810-7816.
[42]WU B,YANG X,SUN C,et al.Learning Effective Value Function Factorization via Attentional Communication[C]//2020 IEEE International Conference on Systems,Man,and Cyberne-tics(SMC).IEEE,2020:629-634.
[43]LIU X,TAN Y.Attentive relational state representation in decentralized multiagent reinforcement learning[J].IEEE Transa-ctions on Cybernetics,2020,52(1):252-264.
[44]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[45]SCHROEDER DE WITT C,FOERSTER J,FARQUHAR G,et al.Multi-agent common knowledge reinforcement learning[J].Advances in Neural Information Processing Systems,2019,32:9927-9939.
[46]ZHENG J,CHEN J,ZHU K.Unmanned Swarm Cooperative Design Based on Multi-agent Reinforcement Learning[J].Command Information System and Technology,2020,11(6):6.
[47]CHU T,WANG J,CODECÀ L,et al.Multi-agent deep rein-forcement learning for large-scale traffic signal control[J].IEEE Transactions on Intelligent Transportation Systems,2019,21(3):1086-1095.
[48]ZHU F,YANG Z,LIN F,et al.Decentralized cooperative control of multiple energy storage systems in urban railway based on multiagent deep reinforcement learning[J].IEEE Transactions on Power Electronics,2020,35(9):9368-9379.
[49]WANG Y,ZHENG K,TIAN D,et al.Cooperative channel assignment for VANETs based on multiagent reinforcementlear-ning[J].Frontiers of Information Technology & Electronic Engineering,2020,21(7):1047-1058.
[50]ZHANG P,TIAN H,ZHAO P T,et al.Computation offloading strategy in multi-agent cooperation scenario based on reinforcement learning with value-decomposition[J].Journal on Communications,2021,42(6):1-15.
[51]XU S,GUO C,HU R Q,et al.Value Decomposition basedMulti-Task Multi-Agent Deep Reinforcement Learning in Vehicular Networks[C]//GLOBECOM 2020-2020 IEEE Global Communications Conference.IEEE,2020:1-6.
[52]ZHANG L X,GUO Y,LI N,et al.Path planning method of autonomous vehicles based on multi agent reinforcement learning[J].Audio Engineering,2021,45(3):52-57.
[53]SU J,ADAMS S,BELING P.Value-Decomposition Multi-Agent Actor-Critics[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:11352-11360.
[54]FANG F B,MA Y T,WANG Z J,et al.Emotion-Based Heterogeneous Multi-agent Reinforcement Learning with Sparse Reward[J].Pattern Recognition and Artificial Intelligence,2021,34(3):223-231.
[55]PU Y,WANG S,YANG R,et al.Decomposed Soft Actor-Critic Method for Cooperative Multi-Agent Reinforcement Learning[J].arXiv:2104.06655,2021.
[56]SHEIKH H U,BÖLÖNI L.Multi-agent reinforcement learning for problems with combined individual and team reward[C]//2020 International Joint Conference on Neural Networks(IJCNN).IEEE,2020:1-8.

相关文章 15

[1]	饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[2]	周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[3]	戴禹, 许林峰. 基于文本行匹配的跨图文本阅读方法 Cross-image Text Reading Method Based on Text Line Matching 计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032
[4]	周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[5]	姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[6]	汪鸣, 彭舰, 黄飞虎. 基于多时间尺度时空图网络的交通流量预测模型 Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction 计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188
[7]	朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[8]	孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[9]	闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[10]	金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[11]	熊罗庚, 郑尚, 邹海涛, 于化龙, 高尚. 融合双向门控循环单元和注意力机制的软件自承认技术债识别方法 Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism 计算机科学, 2022, 49(7): 212-219. https://doi.org/10.11896/jsjkx.210500075
[12]	彭双, 伍江江, 陈浩, 杜春, 李军. 基于注意力神经网络的对地观测卫星星上自主任务规划方法 Satellite Onboard Observation Task Planning Based on Attention Neural Network 计算机科学, 2022, 49(7): 242-247. https://doi.org/10.11896/jsjkx.210500093
[13]	张颖涛, 张杰, 张睿, 张文强. 全局信息引导的真实图像风格迁移 Photorealistic Style Transfer Guided by Global Information 计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[14]	曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨. 基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨 Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism 计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224
[15]	徐鸣珂, 张帆. Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法 Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition 计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed