计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (12): 14-27.DOI: 10.3778/j.issn.1002-8331.2209-0186
赵立阳,常天庆,褚凯轩,郭理彬,张雷
出版日期:
2023-06-15
发布日期:
2023-06-15
ZHAO Liyang, CHANG Tianqing, CHU Kaixuan, GUO Libin, ZHANG Lei
Online:
2023-06-15
Published:
2023-06-15
摘要: 作为机器学习和人工智能领域的重要分支之一,完全合作类多智能体深度强化学习以一种通用的方式将深度强化学习的表达决策能力和多智能体系统的分布协作能力有效结合,为完全合作类多智能体系统中的无模型序贯决策问题提供了一种端对端的解决方案。对深度强化学习的基本原理进行阐述,并从基于值函数、基于策略梯度和基于演员-评论家三个主要方向对单智能体深度强化学习的发展进行了总结。分析了多智能体深度强化学习面临的主要挑战和主要的训练框架。依据实现最大团队联合奖励方式的不同,将完全合作类的多智能体深度强化学习划分为基于独立学习、基于通信学习、基于协作学习和基于奖励函数塑造四大类,并分别进行了总结分析。从解决实际问题的角度出发,对完全合作类多智能体深度强化学习算法的未来发展方向进行了展望。
赵立阳, 常天庆, 褚凯轩, 郭理彬, 张雷. 完全合作类多智能体深度强化学习综述[J]. 计算机工程与应用, 2023, 59(12): 14-27.
ZHAO Liyang, CHANG Tianqing, CHU Kaixuan, GUO Libin, ZHANG Lei. Survey of Fully Cooperative Multi-Agent Deep Reinforcement Learning[J]. Computer Engineering and Applications, 2023, 59(12): 14-27.
[1] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533. [2] HERNANDEZ-LEAL P,KARTAL B,TAYLOR M E.A survey and critique of multiagent deep reinforcement learning[J].Autonomous Agents and Multi-Agent Systems,2019,33(6):750-797. [3] MAO H,SCHWARZKOPF M,VENKATAKRISHNAN S B,et al.Learning scheduling algorithms for data processing clusters[C]//Proceedings of the ACM Special Interest Group on Data Communication,2019:270-288. [4] LONG P,FAN T,LIAO X,et al.Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning[C]//2018 IEEE International Conference on Robotics and Automation(ICRA),2018:6252-6259. [5] LI D,ZHAO D,ZHANG Q,et al.Reinforcement learning and deep learning based lateral control for autonomous driving[J].IEEE Computational Intelligence Magazine,2019,14(2):83-98. [6] LIAO X,LI W,XU Q,et al.Iteratively-refined interactive 3d medical image segmentation with multi-agent reinforcement learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2020:9394-9402. [7] VINYALS O,BABUSCHKIN I,CZARNECKI W M,et al.Grandmaster level in StarCraft II using multi-agent reinforcement learning[J].Nature,2019,575(7782):350-354. [8] YE D,CHEN G,ZHAO P,et al.Supervised learning achieves human-level performance in MOBA games:a case study of honor of kings[J].IEEE Transactions on Neural Networks and Learning Systems,2020,54(5):29-37. [9] 李琛,黄炎焱,张永亮,等.Actor-Critic框架下的多智能体决策方法及其在兵棋上的应用[J].系统工程与电子技术,2021,43(3):755-762. LI C,HUANG Y Y,ZHANG Y L,et al.Multi-agent decision-making method based on Actor-Critic framework and its application in wargame[J].Systems Engineering and Electronics,2021,43(3):755-762. [10] WANG P,GOERTZEL B.Introduction:aspects of artificial general intelligence[C]//Proceedings of the 2007 Conference on Advances in Artificial General Intelligence:Concepts,Architectures and Algorithms,2007:1-16. [11] LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436-444. [12] SCHMIDHUBER J.Deep learning in neural networks:an overview[J].Neural Networks,2015,61:85-117. [13] SUTTON R S,BARTO A G.Reinforcement learning:an introduction[J].IEEE Transactions on Neural Networks,1998,9(5):1054-1054. [14] BERNSTEIN D S,GIVAN R,IMMERMAN N,et al.The complexity of decentralized control of Markov decision processes[J].Mathematics of Operations Research,2002,27(4):819-840. [15] OLIEHOEK F A,AMATO C.A concise introduction to decentralized POMDPs[M].Cham:Springer,2016:14-18. [16] VAN HASSELT H,GUEZ A,SILVER D.Deep reinforcement learning with double q-learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2016:3750-3797. [17] ANSCHEL O,BARAM N,SHIMKIN N.Averaged-DQN:variance reduction and stabilization for deep reinforcement learning[C]//International Conference on Machine Learning,2017:3176-3185. [18] FORTUNATO M,AZAR M G,PIOT B,et al.Noisy networks for exploration[C]//Proceedings of the Sixth International Conference on Learning Representations,2018. [19] SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized experience replay[C]//Proceedings of the 4th International Conference on Learning Representations,2016:4711-4726. [20] WANG Z,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[C]//International Conference on Machine Learning,2016:1995-2003. [21] HAUSKNECHT M,STONE P.Deep recurrent q-learning for partially observable mdps[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2015:2339-2348. [22] SOROKIN I,SELEZNEV A,PAVLOV M,et al.Deep attention recurrent Q-network[J].arXiv:1512.01693,2015. [23] BELLEMARE M G,DABNEY W,MUNOS R.A distributional perspective on reinforcement learning[C]//International Conference on Machine Learning,2017:449-458. [24] HESSEL M,MODAYIL J,VAN HASSELT H,et al.Rainbow:combining improvements in deep reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2018:1021-1037. [25] SILVER D,LEVER G,HEESS N,et al.Deterministic policy gradient algorithms[C]//International Conference on Machine Learning,2014:6387-6395. [26] SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region policy optimization[C]//International Conference on Machine Learning,2015:1889-1897. [27] WU Y,MANSIMOV E,GROSSE R B,et al.Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation[C]//Advances in Neural Information Processing Systems,2017:5279-5288. [28] WANG Z,BAPST V,HEESS N,et al.Sample efficient actor-critic with experience replay[C]//Proceedings of the 4th International Conference on Learning Representations,2016:5275-5291. [29] SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347,2017. [30] LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[C]//Proceedings of the 4th International Conference on Learning Representations,2016:4305-4321. [31] DANKWA S,ZHENG W.Twin-delayed DDPG:a deep reinforcement learning technique to model a continuous movement of an intelligent robot agent[C]//Proceedings of the 3rd International Conference on Vision,Image and Signal Processing,2019:1-5. [32] HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft actor-critic:off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//Proceedings of the 35th International Conference on Machine Learning,2018:1861-1870. [33] KONDA V R,TSITSIKLIS J N.Actor-critic algorithms[C]//Advances in Neural Information Processing Systems,2000:1008-1014. [34] MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning,2016:1928-1937. [35] BABAEIZADEH M,FROSIO I,TYREE S,et al.GA3C:GPU-based A3C for deep reinforcement learning[C]//Advances in the 30th Neural Information Processing Systems,2016:1107-1124. [36] JADERBERG M,MNIH V,CZARNECKI W M,et al.Reinforcement learning with unsupervised auxiliary tasks[J].arXiv:1611.05397,2016. [37] WANG J X,KURTH-NELSON Z,TIEUMALA D,et al.Learning to reinforcement learn[C]//Proceedings of the International Conference on Learning Representations(ICLR).Toulon:ACM,IEEE,2017:1061-1083. [38] ESPEHOLT L,SOYER H,MUNOS R,et al.Impala:scalable distributed deep-RL with importance weighted actor-learner architectures[C]//International Conference on Machine Learning,2018:1407-1416. [39] KRAEMER L,BANERJEE B.Multi-agent reinforcement learning as a rehearsal for decentralized planning[J].Neurocomputing,2016,190:82-94. [40] OLIEHOEK F A,SPAAN M T J,VLASSIS N.Optimal and approximate Q-value functions for decentralized POMDPs[J].Journal of Artificial Intelligence Research,2008,32:289-353. [41] PINTO L,DAVIDSON J,SUKTHANKAR R,et al.Robust adversarial reinforcement learning[C]//International Conference on Machine Learning,2017:2817-2826. [42] CLAUS C,BOUTILIER C.The dynamics of reinforcement learning in cooperative multiagent systems[C]//Proceedings of the 15th National Conference on Artificial Intelligence,1998:746-752. [43] PAPOUDAKIS G,CHRISTIANOS F,RAHMAN A,et al.Dealing with non-stationarity in multi-agent deep reinforcement learning[J].arXiv:1906.04737,2019. [44] HERNANDEZ-LEAL P,KAISERS M,BAARSLAG T,et al.A survey of learning in multiagent environments:dealing with non-stationarity[J].arXiv:1707.09183,2017. [45] LITTMAN M L.Markov games as a framework for multi-agent reinforcement learning[C]//Proceedings of the 11th International Conference on Machine Learning,1994:157-163. [46] FOERSTER J,FARQUHAR G,AFOURAS T,et al.Counterfactual multi-agent policy gradients[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2018:63-82. [47] GUPTA J K,EGOROV M,KOCHENDERFER M.Cooperative multi-agent control using deep reinforcement learning[C]//International Conference on Autonomous Agents and Multiagent Systems.Cham:Springer,2017:66-83. [48] DE WITT C S,GUPTA T,MAKOVIICHUK D,et al.Is independent learning all you need in the StarCraft multi-agent challenge?[J].arXiv:2011.09533,2020. [49] BOWLING M,BURCH N,JOHANSON M,et al.Heads-up limit hold’em poker is solved[J].Science,2015,347(6218):145-149. [50] LEIBO J Z,PEROLAT J,HUGHES E,et al.Malthusian reinforcement learning[C]//Proceedings of the 17th Conference on Autonomous Agents and MultiAgent Systems,2018:45-59. [51] WEI E,LUKE S.Lenient learning in independent-learner stochastic cooperative games[J].The Journal of Machine Learning Research,2016,17(1):2914-2955. [52] FOERSTER J,ASSAEL I A,DE FREITAS N,et al.Learning to communicate with deep multi-agent reinforcement learning[C]//Advances in the 30th Neural Information Processing Systems,2016:2137-2145. [53] SUKHBAATAR S,SZLAM A,FERGUS R,et al.Learning multiagent communication with backpropagation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems,2016:2244-2252. [54] PENG P,WEN Y,YANG Y,et al.Multiagent bidirectionally-coordinated nets:emergence of human-level coordination in learning to play starcraft combat games[J].arXiv:1703. 10069,2017. [55] SCHUSTER M,PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing,1997,45(11):2673-2681. [56] SINGH A,JAIN T,SUKHBAATAR S.Learning when to communicate at scale in multiagent cooperative and competitive tasks[C]//Proceedings of the Seventh International Conference on Learning Representations,2019:465-479. [57] KIM D,MOON S,HOSTALLERO D,et al.Learning to schedule communication in multi-agent reinforcement learning[C]//Proceedings of the Seventh International Conference on Learning Representations,2019:324-341. [58] MAO H,ZHANG Z,XIAO Z,et al.Learning agent communication under limited bandwidth by message pruning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2020:5142-5149. [59] KIM W,CHO M,SUNG Y.Message-dropout:an efficient training method for multi-agent deep reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2019:6079-6086. [60] ZHANG S Q,ZHANG Q,LIN J.Efficient communication in multi-agent reinforcement learning via variance based control[C]//Advances in the Thirty-Third Neural Information Processing Systems,2019:502-519. [61] DAS A,GERVET T,ROMOFF J,et al.Tarmac:targeted multi-agent communication[C]//The Thirty-Sixth International Conference on Machine Learning,2019:1538-1546. [62] DING Z,HUANG T,LU Z.Learning individually inferred communication for multi-agent cooperation[C]//the Thirty-Fourth International Conference on Neural Information Processing Systems(NeurIPS 2020),2020:1369-1394. [63] SUNEHAG P,LEVER G,GRUSLYS A,et al.Value-decomposition networks for cooperative multi-agent learning based on team reward[C]//Proceedings of the 17th Conference on Autonomous Agents and MultiAgent Systems,2018:2085-2087. [64] HARB J,PRECUP D.Investigating recurrence and eligibility traces in deep Q-networks[C]//Advances in the 30th Neural Information Processing Systems,2016:2575-2593. [65] WERBOS P J.Backpropagation through time:what it does and how to do it[J].Proceedings of the IEEE,1990,78(10):1550-1560. [66] RASHID T,SAMVELYAN M,SCHROEDER C,et al.Qmix:monotonic value function factorisation for deep multi-agent reinforcement learning[C]//International Conference on Machine Learning,2018:4295-4304. [67] HA D,DAI A,LE Q V.Hypernetworks[C]//Proceedings of the 5th International Conference on Learning Representations,2017:1708-1724. [68] DUGAS C,BENGIO Y,BéLISLE F,et al.Incorporating functional knowledge in neural networks[J].Journal of Machine Learning Research,2009,10(6):37-51. [69] RASHID T,FARQUHAR G,PENG B,et al.Weighted QMIX:expanding monotonic value function factorisation[C]//the Thirty-Fourth International Conference on Neural Information Processing Systems(NeurIPS 2020),2020:372-389. [70] SON K,KIM DAEWOO,KANG W J,et al.Qtran:learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]//The Thirty-Sixth International Conference on Machine Learning,2019:5887-5896. [71] SON K,AHN S,REYES R D,et al.QTRAN++:improved value transformation for cooperative multi-agent reinforcement learning[J].arXiv:2006.12010,2020. [72] YANG Y,HAO J,LIAO B,et al.Qatten:a general framework for cooperative multiagent reinforcement learning[J].arXiv:2002.03939,2020. [73] MAHAJAN A,RASHID T,SAMVELYAN M,et al.Maven:multi-agent variational exploration[C]//Advances in the Thirty-Third Neural Information Processing Systems,2019:7611-7622. [74] WANG J,REN Z,LIU T,et al.Qplex:duplex dueling multi-agent q-learning[C]//The Ninth International Conference on Learning Representations,2021:834-852. [75] YANG Y,HAO J,CHEN G,et al.Q-value path decomposition for deep multiagent reinforcement learning[C]//International Conference on Machine Learning,2020:10706-10715. [76] SUNDARARAJAN M,TALY A,YAN Q.Axiomatic attribution for deep networks[C]//International Conference on Machine Learning,2017:3319-3328. [77] WANG T,DONG H,LESSER V,et al.Roma:multi-agent reinforcement learning with emergent roles[C]//Proceedings of the 37th International Conference on Machine Learning,2020:2465-2582. [78] WANG T H,GUPTA T,MAHAJAN A,et al.RODE:learning roles to decompose multi-agent tasks[C]//The Ninth International Conference on Learning Representations,2021:1741-1765. [79] ODELL J,NODINE M,LEVY R.A metamodel for agents,roles,and groups[C]//International Workshop on Agent-Oriented Software Engineering.Berlin,Heidelberg:Springer,2004:78-92. [80] WANG J,REN Z,HAN B,et al.Towards understanding linear value decomposition in cooperative multi-agent q-learning[J].arXiv:2006.00587,2020. [81] WOLPERT D H,TUMER K.Optimal payoff functions for members of collectives[J].Advances in Complex Systems,2001,4(2/3):265-279. [82] LOWE R,WU Y,TAMAR A,et al.Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Advances in Neural Information Processing Systems,2017:6379-6390. [83] WEI E,WICKE D,FREELAN D,et al.Multiagent soft q-learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2018:3069-3092. [84] LIU Q,WANG D.Stein variational gradient descent:a general purpose bayesian inference algorithm[C]//Advances in Neural Information Processing Systems,2016:2378-2386. [85] IQBAL S,SHA F.Actor-attention-critic for multi-agent reinforcement learning[C]//The Thirty-sixth International Conference on Machine Learning,2019:2961-2970. [86] YU C,VELU A,VINITSKY E,et al.The surprising effectiveness of mappo in cooperative,multi-agent games[J].arXiv:2103.01955,2021. [87] SILVER D,SINGH S,PRECUP D,et al.Reward is enough[J].Artificial Intelligence,2021:103-135. [88] LEIBO J Z,ZAMBALDI V,LANCTOT M,et al.Multi-agent reinforcement learning in sequential social dilemmas[C]//The 16th Conference on Autonomous Agents and MultiAgent Systems.International Foundation for Autonomous Agents and Multiagent Systems,Richl,SC,2017:464-473. [89] PEYSAKHOVICH A,LERER A.Prosocial learning agents solve generalized stag hunts better than selfish ones[C]//Proceedings of the 17th Conference on Autonomous Agents and MultiAgent Systems,2018:5764-5782. [90] FEHR E,SCHMIDT K M.A theory of fairness,competition,and cooperation[J].The Quarterly Journal of Economics,1999,114(3):817-868. [91] JAQUES N,LAZARIDOU A,HUGHES E,et al.Social influence as intrinsic motivation for multi-agent deep re-inforcement learning[C]//International Conference on Machine Learning,2019:3040-3049. [92] HUGHES E,LEIBO J Z,PHILLIPS M G,et al.Inequity aversion improves cooperation in intertemporal social dilemmas[C]//The Thirty-Sixth International Conference on Machine Learning,2019:2082-2103. [93] LEVINE S,ABBEEL P.Learning neural network policies with guided policy search under unknown dynamics[C]//Conference and Workshop on Neural Information Processing Systems,2014:1071-1079. [94] HEESS N,WAYNE G,SILVER D,et al.Learning continuous control policies by stochastic value gradients[C]//Conference and Workshop on Neural Information Processing Systems,2015:2944-2952. [95] CLAVERA I,ROTHFUSS J,SCHULMAN J,et al.Model-based reinforcement learning via meta-policy optimization[C]//Conference on Robot Learning,2018:617-629. [96] BARTO A G,MAHADEVAN S.Recent advances in hierarchical reinforcement learning[J].Discrete Event Dynamic Systems,2003,13(1):41-77. [97] HUSSEIN A,GABER M M,ELYAN E,et al.Imitation learning:a survey of learning methods[J].ACM Computing Surveys,2017,50(2):1-35. [98] HADFIELD-MENELL D,RUSSELL S J,ABBEEL P,et al.Cooperative inverse reinforcement learning[J].Advances in Neural Information Processing Systems,2016,29:3909-3917. [99] VILALTA R,DRISSI Y.A perspective view and survey of meta-learning[J].Artificial Intelligence Review,2002,18(2):77-95. [100] PAN S J,YANG Q.A survey on transfer learning[J].IEEE Transactions on Knowledge and Data Engineering,2009,22(10):1345-1359. [101] WACHI A,SUI Y.Safe reinforcement learning in constrained Markov decision processes[C]//International Conference on Machine Learning,2020:9797-9806. |
[1] | 陈吉尚, 哈里旦木·阿布都克里木, 梁蕴泽, 阿布都克力木·阿布力孜, 米克拉依·艾山, 郭文强. 深度学习在符号音乐生成中的应用研究综述[J]. 计算机工程与应用, 2023, 59(9): 27-45. |
[2] | 宁强, 刘元盛, 谢龙洋. 基于SAC的自动驾驶车辆控制方法应用[J]. 计算机工程与应用, 2023, 59(8): 306-314. |
[3] | 李瑾晨, 李艳玲, 葛凤培, 林民. 面向法律领域的智能系统研究综述[J]. 计算机工程与应用, 2023, 59(7): 31-50. |
[4] | 韩润海, 陈浩, 刘权, 黄健. 基于对手动作预测的智能博弈对抗算法[J]. 计算机工程与应用, 2023, 59(7): 190-197. |
[5] | 黄晓辉, 凌嘉壕, 张雄, 熊李艳, 曾辉. 基于局部位置感知的多智能体网约车调度方法[J]. 计算机工程与应用, 2023, 59(7): 294-301. |
[6] | 羊波, 王琨, 马祥祥, 范彪, 徐磊, 闫浩. 多智能体强化学习的机械臂运动控制决策研究[J]. 计算机工程与应用, 2023, 59(6): 318-325. |
[7] | 杨笑笑, 柯琳, 陈智斌. 深度强化学习求解车辆路径问题的研究综述[J]. 计算机工程与应用, 2023, 59(5): 1-13. |
[8] | 孙书魁, 范菁, 李占稳, 曲金帅, 路佩东. 人工智能在新型冠状病毒肺炎中的研究综述[J]. 计算机工程与应用, 2023, 59(5): 28-39. |
[9] | 王正安, 徐贞顺, 林令德. 新冠肺炎疫情传播预测方法综述[J]. 计算机工程与应用, 2023, 59(12): 49-61. |
[10] | 梁天恺, 苏新铎, 黄宇恒, 徐天适, 张华俊, 曾碧. 智能化表格识别技术综述[J]. 计算机工程与应用, 2023, 59(12): 62-76. |
[11] | 罗香, 冯元珍. DoS攻击下二阶多智能体系统二分跟踪一致性[J]. 计算机工程与应用, 2023, 59(12): 77-83. |
[12] | 王欣, 赵凯, 秦斌. 面向边缘无服务器计算的WebAssembly应用研究综述[J]. 计算机工程与应用, 2023, 59(11): 28-36. |
[13] | 张启阳, 陈希亮, 曹雷, 赖俊. 基于好奇心机制改进的策略优化算法[J]. 计算机工程与应用, 2023, 59(11): 63-70. |
[14] | 伍洲, 张洪瑞, 张海军, 宋晴. 近邻场优化算法研究与应用综述[J]. 计算机工程与应用, 2022, 58(9): 1-8. |
[15] | 魏婷婷, 袁唯淋, 罗俊仁, 张万鹏. 智能博弈对抗中的对手建模方法及其应用综述[J]. 计算机工程与应用, 2022, 58(9): 19-29. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||