纯策略纳什均衡的博弈强化学习

doi:10.3778/j.issn.1002-8331.2112-0167

摘要/Abstract

摘要： 将博弈理论与多智能体强化学习结合形成博弈强化学习逐渐受到关注，但是也存在算法的计算复杂度高和无法保证纯策略纳什均衡的问题。Meta equilibrium Q-learning算法通过反应函数将原始博弈转换为元博弈，而元博弈推导出的元均衡是纯策略纳什均衡。该算法在保证纯策略纳什均衡的前提下能够使得每个智能体的回报不低于某特定阈值。同时，基于分形的均衡程度评估模型能够通过计算任意状态的分形维数来判断其稳态，并评估任意状态与均衡状态之间的距离，该模型可以检验元均衡的科学性与合理性，上述算法和模型的相关结论在福利博弈和夺控战中都得到具体验证。

关键词: 纯策略纳什均衡, 强化学习, 博弈论, 分形

Abstract: The combination of game theory and multi-agent reinforcement learning to form game reinforcement learning has gradually attracted attention, but there are also problems of high computational complexity of algorithms and inability to guarantee purestrategy Nash equilibrium. The meta-equilibrium Q-learning algorithm converts the original game into a meta-game through the reaction function, and the meta-equilibrium derived from the meta-game is a pure-strategy Nash equilibrium. Under the premise of ensuring the purestrategy Nash equilibrium, the rewards of each agent can larger than a certain threshold. At the same time, the fractal-based equilibrium degree evaluation model can judge the stability of any states by calculating the fractal dimension, and evaluate the distance between the arbitrary states and the equilibrium state. This model can test the scientificity and rationality of the meta-equilibrium. The relevant conclusions of the algorithm and model have been specifically verified in the welfare game and the control war.

Key words: pure strategy Nash equilibrium, reinforcement learning, game theory, fractal

王军, 曹雷, 陈希亮, 陈英, 赵芷若. 纯策略纳什均衡的博弈强化学习[J]. 计算机工程与应用, 2022, 58(15): 78-86.

WANG Jun, CAO Lei, CHEN Xiliang, CHEN Ying, ZHAO Zhiruo. Game Reinforcement Learning of Pure Strategy Nash Equilibrium[J]. Computer Engineering and Applications, 2022, 58(15): 78-86.

参考文献

[1] FAN J，WANG Z，XIE Y，et al.A theoretical analysis of deep Q-learning[J].arXiv：1901.00137，2019.
[2] ARSLAN G，YüKSEL S.Decentralized Q-learning for stochastic teams and games[J].IEEE Transactions on Automatic Control，2016，62（4）：1545-1558.
[3] HU J，WELLMAN M P.Nash Q-learning for general-sum stochastic games[J].Journal of Machine Learning Research，2003，4（11）：1039-1069.
[4] YANG Y，LUO R，LI M，et al.Mean field multi-agent reinforcement learning[C]//International Conference on Machine Learning，2018：5571-5580.
[5] ELIE R，PéROLAT J，LAURIèRE M，et al.On the convergence of model free learning in mean field games[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2020：7143-7150.
[6] 孙彧，曹雷，陈希亮，等.多智能体深度强化学习研究综述[J].计算机工程与应用，2020，56（5）：13-24.
SUN Y，CAO L，CHEN X L，et al.Overview of multi-agent deep reinforcement learning[J].Computer Engineering and Applications，2020，56（5）：13-24.
[7] 王军，曹雷，陈希亮，等.多智能体博弈强化学习研究综述[J].计算机工程与应用，2021，57（21）：1-13.
WANG J，CAO L，CHEN X L，et al.Overview on reinforcement learning of multi-agent game[J].Computer Engineering and Applications，2021，57（21）：1-13.
[8] MAO H，ZHANG Z，XIAO Z，et al.Modelling the dynamic joint policy of teammates with attention multi-agent DDPG[J].arXiv：1811.07029，2018.
[9] FOERSTER J，FARQUHAR G，AFOURAS T，et al.Counterfactual multi-agent policy gradients[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2018，32（1）：1585-1602.
[10] SUNEHAG P，LEVER G，GRUSLYS A，et al.Value decomposition networks for cooperative multi-agent learning[J].arXiv：1706.05296，2017.
[11] RASHID T，SAMVELYAN M，SCHROEDER C，et al.Qmix：monotonic value function factorisation for deep multi-agent reinforcement learning[C]//International Conference on Machine Learning，2018：4295-4304.
[12] LOWE R，WU Y，TAMAR A，et al.Multi-agent actor-critic for mixed cooperative-competitive environments[J].arXiv：1706.02275，2017.
[13] BLUM A，HAGHTALAB N，HAJIAGHAYI M T，et al.Computing stackelberg equilibria of large general sum games[C]//International Symposium on Algorithmic Game Theory.Cham：Springer，2019：168-182.
[14] LITTMAN M L.Friend-or-foe Q-learning in general-sum games[C]//Proceedings of Eighteenth International Conference on Machine Learning，2001：322-328.
[15] AZRIELI Y，SHMAYA E.Lipschitz games[J].Mathematics of Operations Research，2013，38（2）：350-357.
[16] ATTOUCHI A，LUIRO H，PARVIAINEN M.Gradient and Lipschitz estimates for tug-of-war type games[J].SIAM Journal on Mathematical Analysis，2021，53（2）：1295-1319.
[17] FOERSTER J N，CHEN R Y，AL-SHEDIVAT M，et al.Learning with opponent-learning awareness[J].arXiv：1709.04326，2017.
[18] 周志华.机器学习[M].北京：清华大学出版社，2015：390-392.
ZHOU Z H.Machine learning[M].Beijing：Tsinghua University Press，2015：390-392.
[19] CHEN G.A new framework for multi-agent reinforcement learning centralized training and exploration with decentralized execution via policy distillation[J].arXiv：1910. 09152，2019.
[20] LI G，JIANG B，ZHU H，et al.Generative attention networks for multi-agent behavioral modeling[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2020：7195-7202.
[21] WU F，ZILBERSTEIN S，CHEN X.Rollout sampling policy iteration for decentralized POMDPs[J].arXiv：1203. 3528，2012.
[22] LI S，WU Y，CUI X，et al.Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019：4213-4220.
[23] ZHANG J，XIAO P，SUN R，et al.A single-loop smoothed gradient descent-ascent algorithm for nonconvex concave Min-Max problems[J].arXiv：2010.15768，2020.
[24] ADLER I.The equivalence of linear programs and zero-sum games[J].International Journal of Game Theory，2013，42（1）：165-177.
[25] 俞建.博弈论与非线性分析[M].北京：科学出版社，2008.
YU J.Game theory and nonlinear analysis[M].Beijing：Science Press，2008.
[26] FALCONER K.Fractal geometry：mathematical foundations and applications[M].[S.l.]：John Wiley & Sons，2004.
[27] WANG J，YAO K，LIANG Y S.On the connection between the order of Riemann-Liouvile fractional calculus and Hausdorff dimension of a fractal function[J].Analysis in Theory and Applications，2016，32：283-290.
[28] YANG Y，HAO J，CHEN G，et al.Q-value path decomposition for deep multiagent reinforcement learning[C]//International Conference on Machine Learning，2020：10706-10715.
[29] CANDOGAN O，MENACHE I，OZDAGLAR A，et al.Flows and decompositions of games：harmonic and potential games[J].Mathematics of Operations Research，2011，36（3）：474-503.
[30] JIN C，NETRAPALLI P，JORDAN M.What is local optimality in nonconvex nonconcave minimax optimization?[C]//International Conference on Machine Learning，2020：4880-4889.