Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (12): 14-27.DOI: 10.3778/j.issn.1002-8331.2209-0186
• Research Hotspots and Reviews • Previous Articles Next Articles
ZHAO Liyang, CHANG Tianqing, CHU Kaixuan, GUO Libin, ZHANG Lei
Online:
2023-06-15
Published:
2023-06-15
赵立阳,常天庆,褚凯轩,郭理彬,张雷
ZHAO Liyang, CHANG Tianqing, CHU Kaixuan, GUO Libin, ZHANG Lei. Survey of Fully Cooperative Multi-Agent Deep Reinforcement Learning[J]. Computer Engineering and Applications, 2023, 59(12): 14-27.
赵立阳, 常天庆, 褚凯轩, 郭理彬, 张雷. 完全合作类多智能体深度强化学习综述[J]. 计算机工程与应用, 2023, 59(12): 14-27.
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.2209-0186
[1] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533. [2] HERNANDEZ-LEAL P,KARTAL B,TAYLOR M E.A survey and critique of multiagent deep reinforcement learning[J].Autonomous Agents and Multi-Agent Systems,2019,33(6):750-797. [3] MAO H,SCHWARZKOPF M,VENKATAKRISHNAN S B,et al.Learning scheduling algorithms for data processing clusters[C]//Proceedings of the ACM Special Interest Group on Data Communication,2019:270-288. [4] LONG P,FAN T,LIAO X,et al.Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning[C]//2018 IEEE International Conference on Robotics and Automation(ICRA),2018:6252-6259. [5] LI D,ZHAO D,ZHANG Q,et al.Reinforcement learning and deep learning based lateral control for autonomous driving[J].IEEE Computational Intelligence Magazine,2019,14(2):83-98. [6] LIAO X,LI W,XU Q,et al.Iteratively-refined interactive 3d medical image segmentation with multi-agent reinforcement learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2020:9394-9402. [7] VINYALS O,BABUSCHKIN I,CZARNECKI W M,et al.Grandmaster level in StarCraft II using multi-agent reinforcement learning[J].Nature,2019,575(7782):350-354. [8] YE D,CHEN G,ZHAO P,et al.Supervised learning achieves human-level performance in MOBA games:a case study of honor of kings[J].IEEE Transactions on Neural Networks and Learning Systems,2020,54(5):29-37. [9] 李琛,黄炎焱,张永亮,等.Actor-Critic框架下的多智能体决策方法及其在兵棋上的应用[J].系统工程与电子技术,2021,43(3):755-762. LI C,HUANG Y Y,ZHANG Y L,et al.Multi-agent decision-making method based on Actor-Critic framework and its application in wargame[J].Systems Engineering and Electronics,2021,43(3):755-762. [10] WANG P,GOERTZEL B.Introduction:aspects of artificial general intelligence[C]//Proceedings of the 2007 Conference on Advances in Artificial General Intelligence:Concepts,Architectures and Algorithms,2007:1-16. [11] LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436-444. [12] SCHMIDHUBER J.Deep learning in neural networks:an overview[J].Neural Networks,2015,61:85-117. [13] SUTTON R S,BARTO A G.Reinforcement learning:an introduction[J].IEEE Transactions on Neural Networks,1998,9(5):1054-1054. [14] BERNSTEIN D S,GIVAN R,IMMERMAN N,et al.The complexity of decentralized control of Markov decision processes[J].Mathematics of Operations Research,2002,27(4):819-840. [15] OLIEHOEK F A,AMATO C.A concise introduction to decentralized POMDPs[M].Cham:Springer,2016:14-18. [16] VAN HASSELT H,GUEZ A,SILVER D.Deep reinforcement learning with double q-learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2016:3750-3797. [17] ANSCHEL O,BARAM N,SHIMKIN N.Averaged-DQN:variance reduction and stabilization for deep reinforcement learning[C]//International Conference on Machine Learning,2017:3176-3185. [18] FORTUNATO M,AZAR M G,PIOT B,et al.Noisy networks for exploration[C]//Proceedings of the Sixth International Conference on Learning Representations,2018. [19] SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized experience replay[C]//Proceedings of the 4th International Conference on Learning Representations,2016:4711-4726. [20] WANG Z,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[C]//International Conference on Machine Learning,2016:1995-2003. [21] HAUSKNECHT M,STONE P.Deep recurrent q-learning for partially observable mdps[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2015:2339-2348. [22] SOROKIN I,SELEZNEV A,PAVLOV M,et al.Deep attention recurrent Q-network[J].arXiv:1512.01693,2015. [23] BELLEMARE M G,DABNEY W,MUNOS R.A distributional perspective on reinforcement learning[C]//International Conference on Machine Learning,2017:449-458. [24] HESSEL M,MODAYIL J,VAN HASSELT H,et al.Rainbow:combining improvements in deep reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2018:1021-1037. [25] SILVER D,LEVER G,HEESS N,et al.Deterministic policy gradient algorithms[C]//International Conference on Machine Learning,2014:6387-6395. [26] SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust region policy optimization[C]//International Conference on Machine Learning,2015:1889-1897. [27] WU Y,MANSIMOV E,GROSSE R B,et al.Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation[C]//Advances in Neural Information Processing Systems,2017:5279-5288. [28] WANG Z,BAPST V,HEESS N,et al.Sample efficient actor-critic with experience replay[C]//Proceedings of the 4th International Conference on Learning Representations,2016:5275-5291. [29] SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347,2017. [30] LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[C]//Proceedings of the 4th International Conference on Learning Representations,2016:4305-4321. [31] DANKWA S,ZHENG W.Twin-delayed DDPG:a deep reinforcement learning technique to model a continuous movement of an intelligent robot agent[C]//Proceedings of the 3rd International Conference on Vision,Image and Signal Processing,2019:1-5. [32] HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft actor-critic:off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//Proceedings of the 35th International Conference on Machine Learning,2018:1861-1870. [33] KONDA V R,TSITSIKLIS J N.Actor-critic algorithms[C]//Advances in Neural Information Processing Systems,2000:1008-1014. [34] MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning,2016:1928-1937. [35] BABAEIZADEH M,FROSIO I,TYREE S,et al.GA3C:GPU-based A3C for deep reinforcement learning[C]//Advances in the 30th Neural Information Processing Systems,2016:1107-1124. [36] JADERBERG M,MNIH V,CZARNECKI W M,et al.Reinforcement learning with unsupervised auxiliary tasks[J].arXiv:1611.05397,2016. [37] WANG J X,KURTH-NELSON Z,TIEUMALA D,et al.Learning to reinforcement learn[C]//Proceedings of the International Conference on Learning Representations(ICLR).Toulon:ACM,IEEE,2017:1061-1083. [38] ESPEHOLT L,SOYER H,MUNOS R,et al.Impala:scalable distributed deep-RL with importance weighted actor-learner architectures[C]//International Conference on Machine Learning,2018:1407-1416. [39] KRAEMER L,BANERJEE B.Multi-agent reinforcement learning as a rehearsal for decentralized planning[J].Neurocomputing,2016,190:82-94. [40] OLIEHOEK F A,SPAAN M T J,VLASSIS N.Optimal and approximate Q-value functions for decentralized POMDPs[J].Journal of Artificial Intelligence Research,2008,32:289-353. [41] PINTO L,DAVIDSON J,SUKTHANKAR R,et al.Robust adversarial reinforcement learning[C]//International Conference on Machine Learning,2017:2817-2826. [42] CLAUS C,BOUTILIER C.The dynamics of reinforcement learning in cooperative multiagent systems[C]//Proceedings of the 15th National Conference on Artificial Intelligence,1998:746-752. [43] PAPOUDAKIS G,CHRISTIANOS F,RAHMAN A,et al.Dealing with non-stationarity in multi-agent deep reinforcement learning[J].arXiv:1906.04737,2019. [44] HERNANDEZ-LEAL P,KAISERS M,BAARSLAG T,et al.A survey of learning in multiagent environments:dealing with non-stationarity[J].arXiv:1707.09183,2017. [45] LITTMAN M L.Markov games as a framework for multi-agent reinforcement learning[C]//Proceedings of the 11th International Conference on Machine Learning,1994:157-163. [46] FOERSTER J,FARQUHAR G,AFOURAS T,et al.Counterfactual multi-agent policy gradients[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2018:63-82. [47] GUPTA J K,EGOROV M,KOCHENDERFER M.Cooperative multi-agent control using deep reinforcement learning[C]//International Conference on Autonomous Agents and Multiagent Systems.Cham:Springer,2017:66-83. [48] DE WITT C S,GUPTA T,MAKOVIICHUK D,et al.Is independent learning all you need in the StarCraft multi-agent challenge?[J].arXiv:2011.09533,2020. [49] BOWLING M,BURCH N,JOHANSON M,et al.Heads-up limit hold’em poker is solved[J].Science,2015,347(6218):145-149. [50] LEIBO J Z,PEROLAT J,HUGHES E,et al.Malthusian reinforcement learning[C]//Proceedings of the 17th Conference on Autonomous Agents and MultiAgent Systems,2018:45-59. [51] WEI E,LUKE S.Lenient learning in independent-learner stochastic cooperative games[J].The Journal of Machine Learning Research,2016,17(1):2914-2955. [52] FOERSTER J,ASSAEL I A,DE FREITAS N,et al.Learning to communicate with deep multi-agent reinforcement learning[C]//Advances in the 30th Neural Information Processing Systems,2016:2137-2145. [53] SUKHBAATAR S,SZLAM A,FERGUS R,et al.Learning multiagent communication with backpropagation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems,2016:2244-2252. [54] PENG P,WEN Y,YANG Y,et al.Multiagent bidirectionally-coordinated nets:emergence of human-level coordination in learning to play starcraft combat games[J].arXiv:1703. 10069,2017. [55] SCHUSTER M,PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing,1997,45(11):2673-2681. [56] SINGH A,JAIN T,SUKHBAATAR S.Learning when to communicate at scale in multiagent cooperative and competitive tasks[C]//Proceedings of the Seventh International Conference on Learning Representations,2019:465-479. [57] KIM D,MOON S,HOSTALLERO D,et al.Learning to schedule communication in multi-agent reinforcement learning[C]//Proceedings of the Seventh International Conference on Learning Representations,2019:324-341. [58] MAO H,ZHANG Z,XIAO Z,et al.Learning agent communication under limited bandwidth by message pruning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2020:5142-5149. [59] KIM W,CHO M,SUNG Y.Message-dropout:an efficient training method for multi-agent deep reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2019:6079-6086. [60] ZHANG S Q,ZHANG Q,LIN J.Efficient communication in multi-agent reinforcement learning via variance based control[C]//Advances in the Thirty-Third Neural Information Processing Systems,2019:502-519. [61] DAS A,GERVET T,ROMOFF J,et al.Tarmac:targeted multi-agent communication[C]//The Thirty-Sixth International Conference on Machine Learning,2019:1538-1546. [62] DING Z,HUANG T,LU Z.Learning individually inferred communication for multi-agent cooperation[C]//the Thirty-Fourth International Conference on Neural Information Processing Systems(NeurIPS 2020),2020:1369-1394. [63] SUNEHAG P,LEVER G,GRUSLYS A,et al.Value-decomposition networks for cooperative multi-agent learning based on team reward[C]//Proceedings of the 17th Conference on Autonomous Agents and MultiAgent Systems,2018:2085-2087. [64] HARB J,PRECUP D.Investigating recurrence and eligibility traces in deep Q-networks[C]//Advances in the 30th Neural Information Processing Systems,2016:2575-2593. [65] WERBOS P J.Backpropagation through time:what it does and how to do it[J].Proceedings of the IEEE,1990,78(10):1550-1560. [66] RASHID T,SAMVELYAN M,SCHROEDER C,et al.Qmix:monotonic value function factorisation for deep multi-agent reinforcement learning[C]//International Conference on Machine Learning,2018:4295-4304. [67] HA D,DAI A,LE Q V.Hypernetworks[C]//Proceedings of the 5th International Conference on Learning Representations,2017:1708-1724. [68] DUGAS C,BENGIO Y,BéLISLE F,et al.Incorporating functional knowledge in neural networks[J].Journal of Machine Learning Research,2009,10(6):37-51. [69] RASHID T,FARQUHAR G,PENG B,et al.Weighted QMIX:expanding monotonic value function factorisation[C]//the Thirty-Fourth International Conference on Neural Information Processing Systems(NeurIPS 2020),2020:372-389. [70] SON K,KIM DAEWOO,KANG W J,et al.Qtran:learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]//The Thirty-Sixth International Conference on Machine Learning,2019:5887-5896. [71] SON K,AHN S,REYES R D,et al.QTRAN++:improved value transformation for cooperative multi-agent reinforcement learning[J].arXiv:2006.12010,2020. [72] YANG Y,HAO J,LIAO B,et al.Qatten:a general framework for cooperative multiagent reinforcement learning[J].arXiv:2002.03939,2020. [73] MAHAJAN A,RASHID T,SAMVELYAN M,et al.Maven:multi-agent variational exploration[C]//Advances in the Thirty-Third Neural Information Processing Systems,2019:7611-7622. [74] WANG J,REN Z,LIU T,et al.Qplex:duplex dueling multi-agent q-learning[C]//The Ninth International Conference on Learning Representations,2021:834-852. [75] YANG Y,HAO J,CHEN G,et al.Q-value path decomposition for deep multiagent reinforcement learning[C]//International Conference on Machine Learning,2020:10706-10715. [76] SUNDARARAJAN M,TALY A,YAN Q.Axiomatic attribution for deep networks[C]//International Conference on Machine Learning,2017:3319-3328. [77] WANG T,DONG H,LESSER V,et al.Roma:multi-agent reinforcement learning with emergent roles[C]//Proceedings of the 37th International Conference on Machine Learning,2020:2465-2582. [78] WANG T H,GUPTA T,MAHAJAN A,et al.RODE:learning roles to decompose multi-agent tasks[C]//The Ninth International Conference on Learning Representations,2021:1741-1765. [79] ODELL J,NODINE M,LEVY R.A metamodel for agents,roles,and groups[C]//International Workshop on Agent-Oriented Software Engineering.Berlin,Heidelberg:Springer,2004:78-92. [80] WANG J,REN Z,HAN B,et al.Towards understanding linear value decomposition in cooperative multi-agent q-learning[J].arXiv:2006.00587,2020. [81] WOLPERT D H,TUMER K.Optimal payoff functions for members of collectives[J].Advances in Complex Systems,2001,4(2/3):265-279. [82] LOWE R,WU Y,TAMAR A,et al.Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Advances in Neural Information Processing Systems,2017:6379-6390. [83] WEI E,WICKE D,FREELAN D,et al.Multiagent soft q-learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2018:3069-3092. [84] LIU Q,WANG D.Stein variational gradient descent:a general purpose bayesian inference algorithm[C]//Advances in Neural Information Processing Systems,2016:2378-2386. [85] IQBAL S,SHA F.Actor-attention-critic for multi-agent reinforcement learning[C]//The Thirty-sixth International Conference on Machine Learning,2019:2961-2970. [86] YU C,VELU A,VINITSKY E,et al.The surprising effectiveness of mappo in cooperative,multi-agent games[J].arXiv:2103.01955,2021. [87] SILVER D,SINGH S,PRECUP D,et al.Reward is enough[J].Artificial Intelligence,2021:103-135. [88] LEIBO J Z,ZAMBALDI V,LANCTOT M,et al.Multi-agent reinforcement learning in sequential social dilemmas[C]//The 16th Conference on Autonomous Agents and MultiAgent Systems.International Foundation for Autonomous Agents and Multiagent Systems,Richl,SC,2017:464-473. [89] PEYSAKHOVICH A,LERER A.Prosocial learning agents solve generalized stag hunts better than selfish ones[C]//Proceedings of the 17th Conference on Autonomous Agents and MultiAgent Systems,2018:5764-5782. [90] FEHR E,SCHMIDT K M.A theory of fairness,competition,and cooperation[J].The Quarterly Journal of Economics,1999,114(3):817-868. [91] JAQUES N,LAZARIDOU A,HUGHES E,et al.Social influence as intrinsic motivation for multi-agent deep re-inforcement learning[C]//International Conference on Machine Learning,2019:3040-3049. [92] HUGHES E,LEIBO J Z,PHILLIPS M G,et al.Inequity aversion improves cooperation in intertemporal social dilemmas[C]//The Thirty-Sixth International Conference on Machine Learning,2019:2082-2103. [93] LEVINE S,ABBEEL P.Learning neural network policies with guided policy search under unknown dynamics[C]//Conference and Workshop on Neural Information Processing Systems,2014:1071-1079. [94] HEESS N,WAYNE G,SILVER D,et al.Learning continuous control policies by stochastic value gradients[C]//Conference and Workshop on Neural Information Processing Systems,2015:2944-2952. [95] CLAVERA I,ROTHFUSS J,SCHULMAN J,et al.Model-based reinforcement learning via meta-policy optimization[C]//Conference on Robot Learning,2018:617-629. [96] BARTO A G,MAHADEVAN S.Recent advances in hierarchical reinforcement learning[J].Discrete Event Dynamic Systems,2003,13(1):41-77. [97] HUSSEIN A,GABER M M,ELYAN E,et al.Imitation learning:a survey of learning methods[J].ACM Computing Surveys,2017,50(2):1-35. [98] HADFIELD-MENELL D,RUSSELL S J,ABBEEL P,et al.Cooperative inverse reinforcement learning[J].Advances in Neural Information Processing Systems,2016,29:3909-3917. [99] VILALTA R,DRISSI Y.A perspective view and survey of meta-learning[J].Artificial Intelligence Review,2002,18(2):77-95. [100] PAN S J,YANG Q.A survey on transfer learning[J].IEEE Transactions on Knowledge and Data Engineering,2009,22(10):1345-1359. [101] WACHI A,SUI Y.Safe reinforcement learning in constrained Markov decision processes[C]//International Conference on Machine Learning,2020:9797-9806. |
[1] | CHEN Jishang, Abudukelimu Halidanmu, LIANG Yunze, Abulizi Abudukelimu, Aishan Mikelayi, GUO Wenqiang. Review of Application of Deep Learning in Symbolic Music Generation [J]. Computer Engineering and Applications, 2023, 59(9): 27-45. |
[2] | NING Qiang, LIU Yuansheng, XIE Longyang. Application of SAC-Based Autonomous Vehicle Control Method [J]. Computer Engineering and Applications, 2023, 59(8): 306-314. |
[3] | HAN Runhai, CHEN Hao, LIU Quan, HUANG Jian. Intelligent Game Countermeasures Algorithm Based on Opponent Action Prediction [J]. Computer Engineering and Applications, 2023, 59(7): 190-197. |
[4] | HUANG Xiaohui, LING Jiahao, ZHANG Xiong, XIONG Liyan, ZENG Hui. Online Car-Hailing Dispatch Method Based on Local Position Perception Multi-Agent [J]. Computer Engineering and Applications, 2023, 59(7): 294-301. |
[5] | LI Jinchen, LI Yanling, GE Fengpei, LIN Min. Survey of Research on Intelligent System for Legal Domain [J]. Computer Engineering and Applications, 2023, 59(7): 31-50. |
[6] | YANG Xiaoxiao, KE Lin, CHEN Zhibin. Review of Deep Reinforcement Learning Model Research on Vehicle Routing Problems [J]. Computer Engineering and Applications, 2023, 59(5): 1-13. |
[7] | SUN Shukui, FAN Jing, LI Zhanwen, QU Jinshuai, LU Peidong. Survey of Artificial Intelligence in COVID-19 Pandemic [J]. Computer Engineering and Applications, 2023, 59(5): 28-39. |
[8] | WANG Zheng’an, XU Zhenshun, LIN Lingde. Review of COVID-19 Propagation Prediction Methods [J]. Computer Engineering and Applications, 2023, 59(12): 49-61. |
[9] | LIANG Tiankai, SU Xinduo, HUANG Yuheng, XU Tianshi, ZHANG Huajun, ZENG Bi. Survey on Intelligent Table Recognition [J]. Computer Engineering and Applications, 2023, 59(12): 62-76. |
[10] | WANG Xin, ZHAO Kai, QIN Bin. Review of WebAssembly Application Research for Edge Serverless Computing [J]. Computer Engineering and Applications, 2023, 59(11): 28-36. |
[11] | ZHANG Qiyang, CHEN Xiliang, CAO Lei, LAI Jun. Improved Policy Optimization Algorithm Based on Curiosity Mechanism [J]. Computer Engineering and Applications, 2023, 59(11): 63-70. |
[12] | WU Zhou, ZHANG Hongrui, ZHANG Haijun, SONG Qing. Summary of Research and Application of Neighborhood Field Optimization Algorithm [J]. Computer Engineering and Applications, 2022, 58(9): 1-8. |
[13] | WEI Tingting, YUAN Weilin, LUO Junren, ZHANG Wanpeng. Survey of Opponent Modeling Methods and Applications in Intelligent Game Confrontation [J]. Computer Engineering and Applications, 2022, 58(9): 19-29. |
[14] | GAO Jingpeng, HU Xinyu, JIANG Zhiye. Unmanned Aerial Vehicle Track Planning Algorithm Based on Improved DDPG [J]. Computer Engineering and Applications, 2022, 58(8): 264-272. |
[15] | CAI Qiming, ZHANG Lei, XU Chenhao. Research of Process Similarity Based on Single-Layer Neural Network [J]. Computer Engineering and Applications, 2022, 58(7): 295-302. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||