完全合作类多智能体深度强化学习综述

doi:10.3778/j.issn.1002-8331.2209-0186

摘要/Abstract

摘要： 作为机器学习和人工智能领域的重要分支之一，完全合作类多智能体深度强化学习以一种通用的方式将深度强化学习的表达决策能力和多智能体系统的分布协作能力有效结合，为完全合作类多智能体系统中的无模型序贯决策问题提供了一种端对端的解决方案。对深度强化学习的基本原理进行阐述，并从基于值函数、基于策略梯度和基于演员-评论家三个主要方向对单智能体深度强化学习的发展进行了总结。分析了多智能体深度强化学习面临的主要挑战和主要的训练框架。依据实现最大团队联合奖励方式的不同，将完全合作类的多智能体深度强化学习划分为基于独立学习、基于通信学习、基于协作学习和基于奖励函数塑造四大类，并分别进行了总结分析。从解决实际问题的角度出发，对完全合作类多智能体深度强化学习算法的未来发展方向进行了展望。

关键词: 深度强化学习, 多智能体, 完全合作, 人工智能

Abstract: As one of the important branches in the field of machine learning and artificial intelligence, fully cooperative multi-agent deep reinforcement learning effectively combines the expression and decision-making ability of deep reinforcement learning with the distributed cooperation ability of multi-agent system in a general way, which provides an end-to-end solution to the model-free sequential decision-making problem in fully cooperative multi-agent system. Firstly, the basic principles of deep reinforcement learning are described, and the development of single agent deep reinforcement learning is summarized from three main directions：value function based, policy gradient based and actor-critic based. Secondly, the main challenges and training framework of multi-agent deep reinforcement learning are analyzed. Then, according to the different ways of realizing the maximum team joint reward, the fully cooperative multi-agent deep reinforcement learning is divided into four categories：independent learning, communication learning, collaborative learning and reward function shaping. Finally, from the perspective of solving practical problems, the future development direction of fully cooperative multi-agent deep reinforcement learning algorithm is prospected.

Key words: deep reinforcement learning, multi agent, full cooperation, artificial intelligence

赵立阳, 常天庆, 褚凯轩, 郭理彬, 张雷. 完全合作类多智能体深度强化学习综述[J]. 计算机工程与应用, 2023, 59(12): 14-27.

ZHAO Liyang, CHANG Tianqing, CHU Kaixuan, GUO Libin, ZHANG Lei. Survey of Fully Cooperative Multi-Agent Deep Reinforcement Learning[J]. Computer Engineering and Applications, 2023, 59(12): 14-27.

参考文献

[1] MNIH V，KAVUKCUOGLU K，SILVER D，et al.Human-level control through deep reinforcement learning[J].Nature，2015，518（7540）：529-533.
[2] HERNANDEZ-LEAL P，KARTAL B，TAYLOR M E.A survey and critique of multiagent deep reinforcement learning[J].Autonomous Agents and Multi-Agent Systems，2019，33（6）：750-797.
[3] MAO H，SCHWARZKOPF M，VENKATAKRISHNAN S B，et al.Learning scheduling algorithms for data processing clusters[C]//Proceedings of the ACM Special Interest Group on Data Communication，2019：270-288.
[4] LONG P，FAN T，LIAO X，et al.Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning[C]//2018 IEEE International Conference on Robotics and Automation（ICRA），2018：6252-6259.
[5] LI D，ZHAO D，ZHANG Q，et al.Reinforcement learning and deep learning based lateral control for autonomous driving[J].IEEE Computational Intelligence Magazine，2019，14（2）：83-98.
[6] LIAO X，LI W，XU Q，et al.Iteratively-refined interactive 3d medical image segmentation with multi-agent reinforcement learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：9394-9402.
[7] VINYALS O，BABUSCHKIN I，CZARNECKI W M，et al.Grandmaster level in StarCraft II using multi-agent reinforcement learning[J].Nature，2019，575（7782）：350-354.
[8] YE D，CHEN G，ZHAO P，et al.Supervised learning achieves human-level performance in MOBA games：a case study of honor of kings[J].IEEE Transactions on Neural Networks and Learning Systems，2020，54（5）：29-37.
[9] 李琛，黄炎焱，张永亮，等.Actor-Critic框架下的多智能体决策方法及其在兵棋上的应用[J].系统工程与电子技术，2021，43（3）：755-762.
LI C，HUANG Y Y，ZHANG Y L，et al.Multi-agent decision-making method based on Actor-Critic framework and its application in wargame[J].Systems Engineering and Electronics，2021，43（3）：755-762.
[10] WANG P，GOERTZEL B.Introduction：aspects of artificial general intelligence[C]//Proceedings of the 2007 Conference on Advances in Artificial General Intelligence：Concepts，Architectures and Algorithms，2007：1-16.
[11] LECUN Y，BENGIO Y，HINTON G.Deep learning[J].Nature，2015，521（7553）：436-444.
[12] SCHMIDHUBER J.Deep learning in neural networks：an overview[J].Neural Networks，2015，61：85-117.
[13] SUTTON R S，BARTO A G.Reinforcement learning：an introduction[J].IEEE Transactions on Neural Networks，1998，9（5）：1054-1054.
[14] BERNSTEIN D S，GIVAN R，IMMERMAN N，et al.The complexity of decentralized control of Markov decision processes[J].Mathematics of Operations Research，2002，27（4）：819-840.
[15] OLIEHOEK F A，AMATO C.A concise introduction to decentralized POMDPs[M].Cham：Springer，2016：14-18.
[16] VAN HASSELT H，GUEZ A，SILVER D.Deep reinforcement learning with double q-learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2016：3750-3797.
[17] ANSCHEL O，BARAM N，SHIMKIN N.Averaged-DQN：variance reduction and stabilization for deep reinforcement learning[C]//International Conference on Machine Learning，2017：3176-3185.
[18] FORTUNATO M，AZAR M G，PIOT B，et al.Noisy networks for exploration[C]//Proceedings of the Sixth International Conference on Learning Representations，2018.
[19] SCHAUL T，QUAN J，ANTONOGLOU I，et al.Prioritized experience replay[C]//Proceedings of the 4th International Conference on Learning Representations，2016：4711-4726.
[20] WANG Z，SCHAUL T，HESSEL M，et al.Dueling network architectures for deep reinforcement learning[C]//International Conference on Machine Learning，2016：1995-2003.
[21] HAUSKNECHT M，STONE P.Deep recurrent q-learning for partially observable mdps[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2015：2339-2348.
[22] SOROKIN I，SELEZNEV A，PAVLOV M，et al.Deep attention recurrent Q-network[J].arXiv：1512.01693，2015.
[23] BELLEMARE M G，DABNEY W，MUNOS R.A distributional perspective on reinforcement learning[C]//International Conference on Machine Learning，2017：449-458.
[24] HESSEL M，MODAYIL J，VAN HASSELT H，et al.Rainbow：combining improvements in deep reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2018：1021-1037.
[25] SILVER D，LEVER G，HEESS N，et al.Deterministic policy gradient algorithms[C]//International Conference on Machine Learning，2014：6387-6395.
[26] SCHULMAN J，LEVINE S，ABBEEL P，et al.Trust region policy optimization[C]//International Conference on Machine Learning，2015：1889-1897.
[27] WU Y，MANSIMOV E，GROSSE R B，et al.Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation[C]//Advances in Neural Information Processing Systems，2017：5279-5288.
[28] WANG Z，BAPST V，HEESS N，et al.Sample efficient actor-critic with experience replay[C]//Proceedings of the 4th International Conference on Learning Representations，2016：5275-5291.
[29] SCHULMAN J，WOLSKI F，DHARIWAL P，et al.Proximal policy optimization algorithms[J].arXiv：1707.06347，2017.
[30] LILLICRAP T P，HUNT J J，PRITZEL A，et al.Continuous control with deep reinforcement learning[C]//Proceedings of the 4th International Conference on Learning Representations，2016：4305-4321.
[31] DANKWA S，ZHENG W.Twin-delayed DDPG：a deep reinforcement learning technique to model a continuous movement of an intelligent robot agent[C]//Proceedings of the 3rd International Conference on Vision，Image and Signal Processing，2019：1-5.
[32] HAARNOJA T，ZHOU A，ABBEEL P，et al.Soft actor-critic：off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//Proceedings of the 35th International Conference on Machine Learning，2018：1861-1870.
[33] KONDA V R，TSITSIKLIS J N.Actor-critic algorithms[C]//Advances in Neural Information Processing Systems，2000：1008-1014.
[34] MNIH V，BADIA A P，MIRZA M，et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning，2016：1928-1937.
[35] BABAEIZADEH M，FROSIO I，TYREE S，et al.GA3C：GPU-based A3C for deep reinforcement learning[C]//Advances in the 30th Neural Information Processing Systems，2016：1107-1124.
[36] JADERBERG M，MNIH V，CZARNECKI W M，et al.Reinforcement learning with unsupervised auxiliary tasks[J].arXiv：1611.05397，2016.
[37] WANG J X，KURTH-NELSON Z，TIEUMALA D，et al.Learning to reinforcement learn[C]//Proceedings of the International Conference on Learning Representations（ICLR）.Toulon：ACM，IEEE，2017：1061-1083.
[38] ESPEHOLT L，SOYER H，MUNOS R，et al.Impala：scalable distributed deep-RL with importance weighted actor-learner architectures[C]//International Conference on Machine Learning，2018：1407-1416.
[39] KRAEMER L，BANERJEE B.Multi-agent reinforcement learning as a rehearsal for decentralized planning[J].Neurocomputing，2016，190：82-94.
[40] OLIEHOEK F A，SPAAN M T J，VLASSIS N.Optimal and approximate Q-value functions for decentralized POMDPs[J].Journal of Artificial Intelligence Research，2008，32：289-353.
[41] PINTO L，DAVIDSON J，SUKTHANKAR R，et al.Robust adversarial reinforcement learning[C]//International Conference on Machine Learning，2017：2817-2826.
[42] CLAUS C，BOUTILIER C.The dynamics of reinforcement learning in cooperative multiagent systems[C]//Proceedings of the 15th National Conference on Artificial Intelligence，1998：746-752.
[43] PAPOUDAKIS G，CHRISTIANOS F，RAHMAN A，et al.Dealing with non-stationarity in multi-agent deep reinforcement learning[J].arXiv：1906.04737，2019.
[44] HERNANDEZ-LEAL P，KAISERS M，BAARSLAG T，et al.A survey of learning in multiagent environments：dealing with non-stationarity[J].arXiv：1707.09183，2017.
[45] LITTMAN M L.Markov games as a framework for multi-agent reinforcement learning[C]//Proceedings of the 11th International Conference on Machine Learning，1994：157-163.
[46] FOERSTER J，FARQUHAR G，AFOURAS T，et al.Counterfactual multi-agent policy gradients[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2018：63-82.
[47] GUPTA J K，EGOROV M，KOCHENDERFER M.Cooperative multi-agent control using deep reinforcement learning[C]//International Conference on Autonomous Agents and Multiagent Systems.Cham：Springer，2017：66-83.
[48] DE WITT C S，GUPTA T，MAKOVIICHUK D，et al.Is independent learning all you need in the StarCraft multi-agent challenge?[J].arXiv：2011.09533，2020.
[49] BOWLING M，BURCH N，JOHANSON M，et al.Heads-up limit hold’em poker is solved[J].Science，2015，347（6218）：145-149.
[50] LEIBO J Z，PEROLAT J，HUGHES E，et al.Malthusian reinforcement learning[C]//Proceedings of the 17th Conference on Autonomous Agents and MultiAgent Systems，2018：45-59.
[51] WEI E，LUKE S.Lenient learning in independent-learner stochastic cooperative games[J].The Journal of Machine Learning Research，2016，17（1）：2914-2955.
[52] FOERSTER J，ASSAEL I A，DE FREITAS N，et al.Learning to communicate with deep multi-agent reinforcement learning[C]//Advances in the 30th Neural Information Processing Systems，2016：2137-2145.
[53] SUKHBAATAR S，SZLAM A，FERGUS R，et al.Learning multiagent communication with backpropagation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems，2016：2244-2252.
[54] PENG P，WEN Y，YANG Y，et al.Multiagent bidirectionally-coordinated nets：emergence of human-level coordination in learning to play starcraft combat games[J].arXiv：1703.
10069，2017.
[55] SCHUSTER M，PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing，1997，45（11）：2673-2681.
[56] SINGH A，JAIN T，SUKHBAATAR S.Learning when to communicate at scale in multiagent cooperative and competitive tasks[C]//Proceedings of the Seventh International Conference on Learning Representations，2019：465-479.
[57] KIM D，MOON S，HOSTALLERO D，et al.Learning to schedule communication in multi-agent reinforcement learning[C]//Proceedings of the Seventh International Conference on Learning Representations，2019：324-341.
[58] MAO H，ZHANG Z，XIAO Z，et al.Learning agent communication under limited bandwidth by message pruning[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2020：5142-5149.
[59] KIM W，CHO M，SUNG Y.Message-dropout：an efficient training method for multi-agent deep reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019：6079-6086.
[60] ZHANG S Q，ZHANG Q，LIN J.Efficient communication in multi-agent reinforcement learning via variance based control[C]//Advances in the Thirty-Third Neural Information Processing Systems，2019：502-519.
[61] DAS A，GERVET T，ROMOFF J，et al.Tarmac：targeted multi-agent communication[C]//The Thirty-Sixth International Conference on Machine Learning，2019：1538-1546.
[62] DING Z，HUANG T，LU Z.Learning individually inferred communication for multi-agent cooperation[C]//the Thirty-Fourth International Conference on Neural Information Processing Systems（NeurIPS 2020），2020：1369-1394.
[63] SUNEHAG P，LEVER G，GRUSLYS A，et al.Value-decomposition networks for cooperative multi-agent learning based on team reward[C]//Proceedings of the 17th Conference on Autonomous Agents and MultiAgent Systems，2018：2085-2087.
[64] HARB J，PRECUP D.Investigating recurrence and eligibility traces in deep Q-networks[C]//Advances in the 30th Neural Information Processing Systems，2016：2575-2593.
[65] WERBOS P J.Backpropagation through time：what it does and how to do it[J].Proceedings of the IEEE，1990，78（10）：1550-1560.
[66] RASHID T，SAMVELYAN M，SCHROEDER C，et al.Qmix：monotonic value function factorisation for deep multi-agent reinforcement learning[C]//International Conference on Machine Learning，2018：4295-4304.
[67] HA D，DAI A，LE Q V.Hypernetworks[C]//Proceedings of the 5th International Conference on Learning Representations，2017：1708-1724.
[68] DUGAS C，BENGIO Y，BéLISLE F，et al.Incorporating functional knowledge in neural networks[J].Journal of Machine Learning Research，2009，10（6）：37-51.
[69] RASHID T，FARQUHAR G，PENG B，et al.Weighted QMIX：expanding monotonic value function factorisation[C]//the Thirty-Fourth International Conference on Neural Information Processing Systems（NeurIPS 2020），2020：372-389.
[70] SON K，KIM DAEWOO，KANG W J，et al.Qtran：learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]//The Thirty-Sixth International Conference on Machine Learning，2019：5887-5896.
[71] SON K，AHN S，REYES R D，et al.QTRAN++：improved value transformation for cooperative multi-agent reinforcement learning[J].arXiv：2006.12010，2020.
[72] YANG Y，HAO J，LIAO B，et al.Qatten：a general framework for cooperative multiagent reinforcement learning[J].arXiv：2002.03939，2020.
[73] MAHAJAN A，RASHID T，SAMVELYAN M，et al.Maven：multi-agent variational exploration[C]//Advances in the Thirty-Third Neural Information Processing Systems，2019：7611-7622.
[74] WANG J，REN Z，LIU T，et al.Qplex：duplex dueling multi-agent q-learning[C]//The Ninth International Conference on Learning Representations，2021：834-852.
[75] YANG Y，HAO J，CHEN G，et al.Q-value path decomposition for deep multiagent reinforcement learning[C]//International Conference on Machine Learning，2020：10706-10715.
[76] SUNDARARAJAN M，TALY A，YAN Q.Axiomatic attribution for deep networks[C]//International Conference on Machine Learning，2017：3319-3328.
[77] WANG T，DONG H，LESSER V，et al.Roma：multi-agent reinforcement learning with emergent roles[C]//Proceedings of the 37th International Conference on Machine Learning，2020：2465-2582.
[78] WANG T H，GUPTA T，MAHAJAN A，et al.RODE：learning roles to decompose multi-agent tasks[C]//The Ninth International Conference on Learning Representations，2021：1741-1765.
[79] ODELL J，NODINE M，LEVY R.A metamodel for agents，roles，and groups[C]//International Workshop on Agent-Oriented Software Engineering.Berlin，Heidelberg：Springer，2004：78-92.
[80] WANG J，REN Z，HAN B，et al.Towards understanding linear value decomposition in cooperative multi-agent q-learning[J].arXiv：2006.00587，2020.
[81] WOLPERT D H，TUMER K.Optimal payoff functions for members of collectives[J].Advances in Complex Systems，2001，4（2/3）：265-279.
[82] LOWE R，WU Y，TAMAR A，et al.Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Advances in Neural Information Processing Systems，2017：6379-6390.
[83] WEI E，WICKE D，FREELAN D，et al.Multiagent soft q-learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2018：3069-3092.
[84] LIU Q，WANG D.Stein variational gradient descent：a general purpose bayesian inference algorithm[C]//Advances in Neural Information Processing Systems，2016：2378-2386.
[85] IQBAL S，SHA F.Actor-attention-critic for multi-agent reinforcement learning[C]//The Thirty-sixth International Conference on Machine Learning，2019：2961-2970.
[86] YU C，VELU A，VINITSKY E，et al.The surprising effectiveness of mappo in cooperative，multi-agent games[J].arXiv：2103.01955，2021.
[87] SILVER D，SINGH S，PRECUP D，et al.Reward is enough[J].Artificial Intelligence，2021：103-135.
[88] LEIBO J Z，ZAMBALDI V，LANCTOT M，et al.Multi-agent reinforcement learning in sequential social dilemmas[C]//The 16th Conference on Autonomous Agents and MultiAgent Systems.International Foundation for Autonomous Agents and Multiagent Systems，Richl，SC，2017：464-473.
[89] PEYSAKHOVICH A，LERER A.Prosocial learning agents solve generalized stag hunts better than selfish ones[C]//Proceedings of the 17th Conference on Autonomous Agents and MultiAgent Systems，2018：5764-5782.
[90] FEHR E，SCHMIDT K M.A theory of fairness，competition，and cooperation[J].The Quarterly Journal of Economics，1999，114（3）：817-868.
[91] JAQUES N，LAZARIDOU A，HUGHES E，et al.Social influence as intrinsic motivation for multi-agent deep re-inforcement learning[C]//International Conference on Machine Learning，2019：3040-3049.
[92] HUGHES E，LEIBO J Z，PHILLIPS M G，et al.Inequity aversion improves cooperation in intertemporal social dilemmas[C]//The Thirty-Sixth International Conference on Machine Learning，2019：2082-2103.
[93] LEVINE S，ABBEEL P.Learning neural network policies with guided policy search under unknown dynamics[C]//Conference and Workshop on Neural Information Processing Systems，2014：1071-1079.
[94] HEESS N，WAYNE G，SILVER D，et al.Learning continuous control policies by stochastic value gradients[C]//Conference and Workshop on Neural Information Processing Systems，2015：2944-2952.
[95] CLAVERA I，ROTHFUSS J，SCHULMAN J，et al.Model-based reinforcement learning via meta-policy optimization[C]//Conference on Robot Learning，2018：617-629.
[96] BARTO A G，MAHADEVAN S.Recent advances in hierarchical reinforcement learning[J].Discrete Event Dynamic Systems，2003，13（1）：41-77.
[97] HUSSEIN A，GABER M M，ELYAN E，et al.Imitation learning：a survey of learning methods[J].ACM Computing Surveys，2017，50（2）：1-35.
[98] HADFIELD-MENELL D，RUSSELL S J，ABBEEL P，et al.Cooperative inverse reinforcement learning[J].Advances in Neural Information Processing Systems，2016，29：3909-3917.
[99] VILALTA R，DRISSI Y.A perspective view and survey of meta-learning[J].Artificial Intelligence Review，2002，18（2）：77-95.
[100] PAN S J，YANG Q.A survey on transfer learning[J].IEEE Transactions on Knowledge and Data Engineering，2009，22（10）：1345-1359.
[101] WACHI A，SUI Y.Safe reinforcement learning in constrained Markov decision processes[C]//International Conference on Machine Learning，2020：9797-9806.