智能博弈对抗中的对手建模方法及其应用综述

doi:10.3778/j.issn.1002-8331.2202-0297

摘要/Abstract

摘要： 智能博弈对抗一直是人工智能研究的热点。在博弈对抗环境中，通过对对手进行建模，可以推测敌对智能体动作、目标、策略等相关属性，为博弈策略制定提供关键信息。对手建模方法在竞技类游戏和作战仿真推演等领域的应用前景广阔，博弈策略的制定必须以博弈各方的行动策略为前提，因此建立一个准确的对手行为模型对于预测其意图尤其重要。从内涵、方法、应用三个方面，阐述了对手建模的必要性，对现有建模方式进行了分类；对基于强化学习的预测方法、基于心智理论的推理方法和基于贝叶斯的优化方法进行了梳理与总结；以序贯博弈（德州扑克）、即时策略博弈（星际争霸）和元博弈为典型应用场景，分析了智能博弈对抗过程中的对手建模的作用；从有限理性、策略欺骗性和可解释性三个方面进行了对手建模技术发展的展望。

关键词: 对手建模, 不完美信息, 行为预测, 深度强化学习, 递归推理, 元博弈

Abstract: Intelligent game confrontation has always been the focus of artificial intelligence research. In the game confrontation environment, the actions, goals, strategies, and other related attributes of agent can be inferred by opponent modeling, which provides key information for game strategy formulation. The application of opponent modeling method in competitive games and combat simulation is promising, and the formulation of game strategy must be premised on the action strategy of all parties in the game, so it is especially important to establish an accurate model of opponent behavior to predict its intention. From three dimensions of connotation, method, and application, the necessity of opponent modeling is expounded and the existing modeling methods are classified. The prediction method based on reinforcement learning, reasoning method based on theory of mind, and optimization method based on Bayesian are summarized. Taking the sequential game（Texas Hold’em）, real-time strategy game（StarCraft）, and meta-game as typical application scenarios, the role of opponent modeling in intelligent game confrontation is analyzed. Finally, the development of adversary modeling technology prospects from three aspects of bounded rationality, deception strategy and interpretability.

Key words: opponent modeling, imperfect information, behavior prediction, deep reinforcement learning, recursive reasoning, meta-game

魏婷婷, 袁唯淋, 罗俊仁, 张万鹏. 智能博弈对抗中的对手建模方法及其应用综述[J]. 计算机工程与应用, 2022, 58(9): 19-29.

WEI Tingting, YUAN Weilin, LUO Junren, ZHANG Wanpeng. Survey of Opponent Modeling Methods and Applications in Intelligent Game Confrontation[J]. Computer Engineering and Applications, 2022, 58(9): 19-29.

参考文献

[1] SILVER D，HUANG A，MADDOSON C J，et al.Mastering the game of Go with deep neural networks and tree search[J].Nature，2016，529（7587）：484-489.
[2] BROWN N，SANDHOLM T.Superhuman AI for multiplayer poker[J].Science，2019，365（6456）：885-890.
[3] VINYALS O，BABUSCHKIN I，CZARNECKI W M，et al.Grandmaster level in StarCraft II using multi-agent reinforcement learning[J].Nature，2019，575（7782）：350-354.
[4] 高巍，罗俊仁，袁唯淋，等.面向对手建模的意图识别方法综述[J].网络与信息安全学报，2021，7（4）：86-100.
GAO W，LUO J R，YUAN W L，et al.Survey of intention recognition for opponent modeling[J].Chinese Journal of Network and Information Security，2021，7（4）：86-100.
[5] GRAYSON T.Mosaic warfare[R].DAPPA/STO，2018．
[6] DAN J.Air combat evolution[EB/OL].（2019-05-17）[2020-05-01].https：//www.darpa.mil/attachments/ACE_ProposersDayProgramBrief.pdf.
[7] NASH J.Non-cooperative games[J].Annals of Mathematics，1951：286-295.
[8] ALBRECHT S V，STONE P.Autonomous agents modeling other agents：a comprehensive survey and open problems[J].Artificial Intelligence，2018，258：66-95.
[9] BILLINGS D，DAVIDSON A，SCHAUENBERG T，et al.Game-tree search with adaptation in stochastic imperfect-information games[C]//International Conference on Computers and Games.Berlin，Heidelberg：Springer，2004：21-34.
[10] BROWNE C B，POWLEY E，WHITEHOUSE D，et al.A survey of Monte Carlo tree search methods[J].IEEE Transactions on Computational Intelligence and AI in Games，2012，4（1）：1-43.
[11] ALBRECHT S V，STONE P.Reasoning about hypothetical agent behaviours and their parameters[C]//Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems，2017：547-555.
[12] BOMBINI G，DI M N，FERILLI S，et al.Classifying agent behaviour through relational sequential patterns[C]//KES International Symposium on Agent and Multi-Agent Systems：Technologies and Applications.Berlin，Heidelberg：Springer，2010：273-282.
[13] FREEDMAN R G，ZILBERSTEIN S.A unifying perspective of plan，activity，and intent recognition[C]//Proceedings of the Workshop on Plan，Activity，and Intent Recognition，2019：1-8.
[14] VIDAL J M，DURFEE E H.Recursive agent modeling using limited rationality[C]//Proceedings of the First International Conference on Multi-Agent Systems，1995：376-383.
[15] SONU E，DOSHI P.Scalable solutions of interactive POMDPs using generalized and bounded policy iteration[J].Autonomous Agents and Multi-Agent Systems，2015，29（3）：455-494.
[16] HERNANDEZ-LEAL P，KAISERS M，BAARSLAG T，et al.A survey of learning in multiagent environments：dealing with non-stationarity[EB/OL].（2019-03-11）[2021-06-01].https：//arxiv.org/abs/1707.09183v1.
[17] 罗俊仁，张万鹏，袁唯淋，等，面向多智能体博弈对抗的对手建模框架[J/OL].系统仿真学报：1-13[2022-02-16].http：//kns.cnki.net/kcms/detail/11.3092.V.20210818.1041.
007.html.
LUO J R，ZHANG W P，YUAN W L，et al.Research on opponent modeling framework for multi-agent game confrontation[J].Journal of System Simulation：1-13[2022-02-16].http：//kns.cnki.net/kcms/detail/11.3092.V.20210818.
1041.007.html.
[18] HE H，BOYD-GRABER J，KWOK K，et al.Opponent modeling in deep reinforcement learning[C]//International Conference on Machine Learning，2016：1804-1813.
[19] MNIH V，KAVUKCUOGLU K，SILVER D，et al.Playing atari with deep reinforcement learning[EB/OL].（2013-12-19）[2021-06-03].https：//arxiv.org/abs/1312.5602.
[20] HONG Z W，SU S Y，SHANN T Y，et al.A deep policy inference q-network for multi-agent systems[EB/OL].（2018-04-09）[2021-06-05].https：//arxiv.org/abs/1712.07893.
[21] EVERETT R，ROBERTS S.Learning against non-stationary agents with opponent modeling and deep reinforcement learning[C]//2018 AAAI Spring Symposium Series，2018.
[22] TIAN Z，WEN Y，GONG Z，et al.A regularized opponent model with maximum entropy objective[EB/OL].（2019-08-19）[2021-06-06].https：//arxiv.org/abs/1905.08087.
[23] AL-SHEDIVAT M，BANSAL T，BURDA Y，et al.Continuous adaptation via meta-learning in nonstationary and competitive environments[EB/OL].（2018-02-23）[2021-06-10].https：//arxiv.org/abs/1710.03641.
[24] WU Z，LI K，ZHAO E，et al.L2E：learning to exploit your opponent[EB/OL].（2021-01-18）[2021-06-30].https：//arxiv.org/abs/2102.09381.
[25] RABINOWITZ N，PERBET F，SONG F，et al.Machine theory of mind[C]//International Conference on Machine Learning，2018：4218-4227.
[26] WEN Y，YANG Y，LUO R，et al.Modeling bounded rationality in multi-agent interactions by generalized recursive reasoning[EB/OL].（2020-03-20）[2021-07-20].https：//arxiv.org/abs/1901.09216.
[27] WEN Y，YANG Y，LUO R，et al.Probabilistic recursive reasoning for multi-agent reinforcement learning[EB/OL].（2019-03-01）[2021-07-21].https：//arxiv.org/abs/1901.09207v2.
[28] FOERSTER J N，CHEN R Y，AL-SHEDIVAT M，et al.Learning with opponent-learning awareness[EB/OL].（2018-09-19）[2021-06-18].https：//arxiv.org/abs/1709.04326.
[29] DAVIES I，TIAN Z，WANG J.Learning to model opponent learning（student abstract）[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2020：13771-13772.
[30] SOUTHEY F，BOWLING M P，LARSON B，et al.Bayes’ bluff：opponent modeling in poker[EB/OL].（2012-07-04）[2021-07-25].https：//arxiv.org/abs/1207.1411.
[31] GANZFRIED S，SUN Q.Bayesian opponent exploitation in imperfect-information games[C]//2018 IEEE Conference on Computational Intelligence and Games（CIG），2018：1-8.
[32] ZHENG Y，MENG Z，HAO J，et al.A deep Bayesian policy reuse approach against non-stationary agents[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems，2018：962-972.
[33] YU X，JIANG J，JIANG H，et al.Model-based opponent modeling[EB/OL].（2021-09-04）[2021-07-25].https：//arxiv.org/abs/2108.01843.
[34] HERNANDEZ-LEAL P，ROSMAN B，TAYLOR M E，et al.A Bayesian approach for learning and tracking switching，non-stationary opponents[C]//Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems，2016：1315-1316.
[35] HARTFORD J S.Deep learning for predicting human strategic behavior[D].University of British Columbia，2016.
[36] HOCHREITER S，SCHMIDHUBER J.Long short-term memory[J].Neural Computation，1997，9（8）：1735-1780.
[37] VANSCHOREN J.Meta-learning：a survey[EB/OL].（2018-10-08）[2021-07-28].https：//arxiv.org/abs/1810.03548.
[38] FINN C，ABBEEL P，LEVINE S.Model-agnostic meta-learning for fast adaptation of deep networks[C]//International Conference on Machine Learning，2017：1126-1135.
[39] DE WEERD H，VERBRUGGE R，VERHEIJ B.How much does it help to know what she knows you know? An agent-based simulation study[J].Artificial Intelligence，2013，199：67-92.
[40] TIAN R，TOMIZUKA M，SUN L.Learning human rewards by inferring their latent intelligence levels in multi-agent games：a theory-of-mind approach with application to driving data[C]//2021 IEEE/RSJ International Conference on Intelligent Robots and Systems（IROS），2021：4560-4567.
[41] CAMERER C F，HO T H，CHONG J K.A cognitive hierarchy theory of one-shot games：some preliminary results[J].Levine’s Bibliography，2003，127（5）：7-42.
[42] WRIGHT J R，LEYTON-BROWN K.Level-0 models for predicting human behavior in games[J].Journal of Artificial Intelligence Research，2019，64：357-383.
[43] DUAN Y，SCHULMAN J，CHEN X，et al.Rl2：fast reinforcement learning via slow reinforcement learning[EB/OL].（2016-11-10）[2021-07-30].https：//arxiv.org/abs/1611.
02779v2.
[44] LOWE R，WU Y I，TAMAR A，et al.Multi-agent actor-critic for mixed cooperative-competitive environments[J].Advances in Neural Information Processing Systems，2017，30.
[45] RUSU A A，COLMENAREJO S G，GULCEHRE C，et al.Policy distillation[EB/OL].（2016-01-07）[2021-08-10].https：//arxiv.org/abs/1511.06295.
[46] BROWN N，SANDHOLM T.Superhuman AI for heads-up no-limit poker：libratus beats top professionals[J].Science，2018，359（6374）：418-424.
[47] 吴松.德州扑克中对手模型的研究[D].哈尔滨：哈尔滨工业大学，2013.
WU S.Research of opponent modeling in Texas Hold’em[D].Harbin：Harbin Institute of Technology，2013.
[48] 张加佳.非完备信息机器博弈中风险及对手模型的研究[D].哈尔滨：哈尔滨工业大学，2015.
ZHANG J J.Research on risk and opponent modeling in imperfect information game[D].Harbin：Harbin Institute of Technology，2015.
[49] 毛建博.基于虚拟自我对局的多人非完备信息机器博弈策略研究[D].哈尔滨：哈尔滨工业大学，2018.
MAO J B.Research on multi-player imperfect information game strategy based on fictious self-play[D].Harbin：Harbin Institute of Technology，2018.
[50] 吴天栋.非完备信息机器博弈算法及对手模型的研究[D].武汉：武汉理工大学，2018.
WU T D.Research on incomplete information machine game algorithm and opponent model[D].Wuhan：Wuhan University of Technology，2018.
[51] LI X，MIIKKULAINEN R.Opponent modeling and exploitation in poker using evolved recurrent neural networks[C]//Proceedings of the Genetic and Evolutionary Computation Conference，2018：189-196.
[52] NASHED S，ZILBERSTEIN S.A survey of opponent modeling in adversarial domains[J].Journal of Artificial Intelligence Research，2022，73：277-327.
[53] JOHANAON M B.Robust strategies and counter-strategies：from superhuman to optimal play[D].Alberta：University of Alberta，2016.
[54] JOHANAON M B，BOWLING M，ZINKEVICH M.Computing robust counter-strategies[EB/OL].（2007-09-26）[2022-03-18].https：//martin.zinkevich.org/publications/rnash.pdf.
[55] OHANAON M B，BOWLING M.Data biased robust counter strategies[C]//Proceedings of 12th International Conference on Artificial Intelligence and Statistics，2009：264-271.
[56] ABOU RISK N，SZAFRON D.Using counterfactual regret minimization to create competitive multiplayer poker agents[C]//International Conference on Autonomous Agents and Multiagent Systems，2010：159-166.
[57] DIETTERICH T G.Ensemble learning[J].The handbook of Brain Theory and Neural Networks，2002，2（1）：110-125.
[58] EKMEKCI O，SIRIN V.Learning strategies for opponent modeling in poker[C]//Workshops at the Twenty-Seventh AAAI Conference on Artificial Intelligence，2013.
[59] 张宏达，李德才，何玉庆.人工智能与“星际争霸”：多智能体博弈研究新进展[J].无人系统技术，2019，2（1）：5-16.
ZHANG H D，LI D C，HE Y Q.Artificial intelligence and StarCraft：new progress in multiagent game research[J].Unmanned Systems Technology，2019，2（1）：5-16.
[60] WEBER B G，MATEAS M.A data mining approach to strategy prediction[C]//2009 IEEE Symposium on Computational Intelligence and Games，2009：140-147.
[61] URIARTE A，ONTANON S.Combat models for RTS games[J].IEEE Transactions on Games，2017，10（1）：29-41.
[62] SYNNAEVE G，BESSIERE P.A Bayesian model for opening prediction in RTS games with application to StarCraft[C]//2011 IEEE Conference on Computational Intelligence and Games（CIG’11），2011：281-288.
[63] ONTANON S，BURO M.Adversarial hierarchical-task network planning for complex real-time games[C]//Twenty-Fourth International Joint Conference on Artificial Intelligence，2015.
[64] LIN S，ANSHI Z，BO L，et al.HTN guided adversarial planning for RTS games[C]//2020 IEEE International Conference on Mechatronics and Automation（ICMA），2020：1326-1331.
[65] BROWN G W.Iterative solution of games by fictitious play[J].Activity Analysis of Production and Allocation，1951，13（1）：374-376.
[66] HERINRICH J，LANCTOT M，SILVER D.Fictitious self-play in extensive-form games[C]//International Conference on Machine Learning，2015.
[67] HEINRICH J，SILVER D.Deep reinforcement learning from self-play in imperfect-information games[EB/OL].（2016-06-28）[2021-08-12].https：//arxiv.org/abs/1603.01121.
[68] ZHANG L，WANG W，LI S，et al.Monte Carlo neural fictitious self-play：approach to approximate nash equilibrium of imperfect-information games[EB/OL].（2019-04-06）[2021-08-16].https：//arxiv.org/abs/1903.09569v2.
[69] LANCTOT M，ZAMBALDI V，GRUSIYS A，et al.A unified game-theoretic approach to multiagent reinforcement learning[EB/OL].（2017-11-07）[2021-08-20].https：//arxiv.org/abs/1711.00832v1.
[70] OMIDSHAFIEI S，PAPADIMITRIOU C，PILIOURAS G，et al.α-rank：multi-agent evaluation by evolution[J].Scientific Reports，2019，9（1）：1-29.
[71] MULLER P，OMIDSHAFIEI S，ROWLAND M，et al.A generalized training approach for multiagent learning[EB/OL].（2020-02-14）[2021-09-10].https：//arxiv.org/abs/1909.12823v2.
[72] MCALEER S，LANIER J，FOX R，et al.Pipeline psro：a scalable approach for finding approximate nash equilibria in large games[EB/OL].（2021-02-18）[2021-09-02].https：//arxiv.org/abs/2006.08555v2.
[73] TIAN Z，REN H，YANG Y，et al.Learning to safely exploit a non-stationary opponent[EB/OL].（2021-05-22）[2021-09-14].https：//openreview.net/pdf?id=zoQJBVrhnn3.
[74] LIU M，WU C，LIU Q，et al.Safe opponent-exploitation subgame refinement[EB/OL].（2021-09-29）[2022-01-14].https：//openreview.net/pdf?id=VwSHZgruNEc.
[75] 袁唯淋，廖志勇，高巍，等.计算机扑克智能博弈研究综述[J].网络与信息安全学报，2021，7（5）：57-76.
YUAN W L，LIAO Z Y，GAO W，et al.Survey on intelligent game of computer poker[J].Chinese Journal of Network and Information Security，2021，7（5）：57-76.