Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (15): 78-86.DOI: 10.3778/j.issn.1002-8331.2112-0167

• Theory, Research and Development • Previous Articles     Next Articles

Game Reinforcement Learning of Pure Strategy Nash Equilibrium

WANG Jun, CAO Lei, CHEN Xiliang, CHEN Ying, ZHAO Zhiruo   

  1. 1.College of Command Information System, Army Engineering University, Nanjing 210007, China
    2.Postdoctoral Research Workstation of Eastern Theater General Hospital, Nanjing 210002, China
  • Online:2022-08-01 Published:2022-08-01

纯策略纳什均衡的博弈强化学习

王军,曹雷,陈希亮,陈英,赵芷若   

  1. 1.陆军工程大学 指挥控制工程学院,南京 210007 
    2.东部战区总医院 博士后科研工作站,南京 210002

Abstract: The combination of game theory and multi-agent reinforcement learning to form game reinforcement learning has gradually attracted attention, but there are also problems of high computational complexity of algorithms and inability to guarantee purestrategy Nash equilibrium. The meta-equilibrium Q-learning algorithm converts the original game into a meta-game through the reaction function, and the meta-equilibrium derived from the meta-game is a pure-strategy Nash equilibrium. Under the premise of ensuring the purestrategy Nash equilibrium, the rewards of each agent can larger than a certain threshold. At the same time, the fractal-based equilibrium degree evaluation model can judge the stability of any states by calculating the fractal dimension, and evaluate the distance between the arbitrary states and the equilibrium state. This model can test the scientificity and rationality of the meta-equilibrium. The relevant conclusions of the algorithm and model have been specifically verified in the welfare game and the control war.

Key words: pure strategy Nash equilibrium, reinforcement learning, game theory, fractal

摘要: 将博弈理论与多智能体强化学习结合形成博弈强化学习逐渐受到关注,但是也存在算法的计算复杂度高和无法保证纯策略纳什均衡的问题。Meta equilibrium Q-learning算法通过反应函数将原始博弈转换为元博弈,而元博弈推导出的元均衡是纯策略纳什均衡。该算法在保证纯策略纳什均衡的前提下能够使得每个智能体的回报不低于某特定阈值。同时,基于分形的均衡程度评估模型能够通过计算任意状态的分形维数来判断其稳态,并评估任意状态与均衡状态之间的距离,该模型可以检验元均衡的科学性与合理性,上述算法和模型的相关结论在福利博弈和夺控战中都得到具体验证。

关键词: 纯策略纳什均衡, 强化学习, 博弈论, 分形