计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (9): 19-29.DOI: 10.3778/j.issn.1002-8331.2202-0297
魏婷婷,袁唯淋,罗俊仁,张万鹏
出版日期:
2022-05-01
发布日期:
2022-05-01
WEI Tingting, YUAN Weilin, LUO Junren, ZHANG Wanpeng
Online:
2022-05-01
Published:
2022-05-01
摘要: 智能博弈对抗一直是人工智能研究的热点。在博弈对抗环境中,通过对对手进行建模,可以推测敌对智能体动作、目标、策略等相关属性,为博弈策略制定提供关键信息。对手建模方法在竞技类游戏和作战仿真推演等领域的应用前景广阔,博弈策略的制定必须以博弈各方的行动策略为前提,因此建立一个准确的对手行为模型对于预测其意图尤其重要。从内涵、方法、应用三个方面,阐述了对手建模的必要性,对现有建模方式进行了分类;对基于强化学习的预测方法、基于心智理论的推理方法和基于贝叶斯的优化方法进行了梳理与总结;以序贯博弈(德州扑克)、即时策略博弈(星际争霸)和元博弈为典型应用场景,分析了智能博弈对抗过程中的对手建模的作用;从有限理性、策略欺骗性和可解释性三个方面进行了对手建模技术发展的展望。
魏婷婷, 袁唯淋, 罗俊仁, 张万鹏. 智能博弈对抗中的对手建模方法及其应用综述[J]. 计算机工程与应用, 2022, 58(9): 19-29.
WEI Tingting, YUAN Weilin, LUO Junren, ZHANG Wanpeng. Survey of Opponent Modeling Methods and Applications in Intelligent Game Confrontation[J]. Computer Engineering and Applications, 2022, 58(9): 19-29.
[1] SILVER D,HUANG A,MADDOSON C J,et al.Mastering the game of Go with deep neural networks and tree search[J].Nature,2016,529(7587):484-489. [2] BROWN N,SANDHOLM T.Superhuman AI for multiplayer poker[J].Science,2019,365(6456):885-890. [3] VINYALS O,BABUSCHKIN I,CZARNECKI W M,et al.Grandmaster level in StarCraft II using multi-agent reinforcement learning[J].Nature,2019,575(7782):350-354. [4] 高巍,罗俊仁,袁唯淋,等.面向对手建模的意图识别方法综述[J].网络与信息安全学报,2021,7(4):86-100. GAO W,LUO J R,YUAN W L,et al.Survey of intention recognition for opponent modeling[J].Chinese Journal of Network and Information Security,2021,7(4):86-100. [5] GRAYSON T.Mosaic warfare[R].DAPPA/STO,2018. [6] DAN J.Air combat evolution[EB/OL].(2019-05-17)[2020-05-01].https://www.darpa.mil/attachments/ACE_ProposersDayProgramBrief.pdf. [7] NASH J.Non-cooperative games[J].Annals of Mathematics,1951:286-295. [8] ALBRECHT S V,STONE P.Autonomous agents modeling other agents:a comprehensive survey and open problems[J].Artificial Intelligence,2018,258:66-95. [9] BILLINGS D,DAVIDSON A,SCHAUENBERG T,et al.Game-tree search with adaptation in stochastic imperfect-information games[C]//International Conference on Computers and Games.Berlin,Heidelberg:Springer,2004:21-34. [10] BROWNE C B,POWLEY E,WHITEHOUSE D,et al.A survey of Monte Carlo tree search methods[J].IEEE Transactions on Computational Intelligence and AI in Games,2012,4(1):1-43. [11] ALBRECHT S V,STONE P.Reasoning about hypothetical agent behaviours and their parameters[C]//Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems,2017:547-555. [12] BOMBINI G,DI M N,FERILLI S,et al.Classifying agent behaviour through relational sequential patterns[C]//KES International Symposium on Agent and Multi-Agent Systems:Technologies and Applications.Berlin,Heidelberg:Springer,2010:273-282. [13] FREEDMAN R G,ZILBERSTEIN S.A unifying perspective of plan,activity,and intent recognition[C]//Proceedings of the Workshop on Plan,Activity,and Intent Recognition,2019:1-8. [14] VIDAL J M,DURFEE E H.Recursive agent modeling using limited rationality[C]//Proceedings of the First International Conference on Multi-Agent Systems,1995:376-383. [15] SONU E,DOSHI P.Scalable solutions of interactive POMDPs using generalized and bounded policy iteration[J].Autonomous Agents and Multi-Agent Systems,2015,29(3):455-494. [16] HERNANDEZ-LEAL P,KAISERS M,BAARSLAG T,et al.A survey of learning in multiagent environments:dealing with non-stationarity[EB/OL].(2019-03-11)[2021-06-01].https://arxiv.org/abs/1707.09183v1. [17] 罗俊仁,张万鹏,袁唯淋,等,面向多智能体博弈对抗的对手建模框架[J/OL].系统仿真学报:1-13[2022-02-16].http://kns.cnki.net/kcms/detail/11.3092.V.20210818.1041. 007.html. LUO J R,ZHANG W P,YUAN W L,et al.Research on opponent modeling framework for multi-agent game confrontation[J].Journal of System Simulation:1-13[2022-02-16].http://kns.cnki.net/kcms/detail/11.3092.V.20210818. 1041.007.html. [18] HE H,BOYD-GRABER J,KWOK K,et al.Opponent modeling in deep reinforcement learning[C]//International Conference on Machine Learning,2016:1804-1813. [19] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing atari with deep reinforcement learning[EB/OL].(2013-12-19)[2021-06-03].https://arxiv.org/abs/1312.5602. [20] HONG Z W,SU S Y,SHANN T Y,et al.A deep policy inference q-network for multi-agent systems[EB/OL].(2018-04-09)[2021-06-05].https://arxiv.org/abs/1712.07893. [21] EVERETT R,ROBERTS S.Learning against non-stationary agents with opponent modeling and deep reinforcement learning[C]//2018 AAAI Spring Symposium Series,2018. [22] TIAN Z,WEN Y,GONG Z,et al.A regularized opponent model with maximum entropy objective[EB/OL].(2019-08-19)[2021-06-06].https://arxiv.org/abs/1905.08087. [23] AL-SHEDIVAT M,BANSAL T,BURDA Y,et al.Continuous adaptation via meta-learning in nonstationary and competitive environments[EB/OL].(2018-02-23)[2021-06-10].https://arxiv.org/abs/1710.03641. [24] WU Z,LI K,ZHAO E,et al.L2E:learning to exploit your opponent[EB/OL].(2021-01-18)[2021-06-30].https://arxiv.org/abs/2102.09381. [25] RABINOWITZ N,PERBET F,SONG F,et al.Machine theory of mind[C]//International Conference on Machine Learning,2018:4218-4227. [26] WEN Y,YANG Y,LUO R,et al.Modeling bounded rationality in multi-agent interactions by generalized recursive reasoning[EB/OL].(2020-03-20)[2021-07-20].https://arxiv.org/abs/1901.09216. [27] WEN Y,YANG Y,LUO R,et al.Probabilistic recursive reasoning for multi-agent reinforcement learning[EB/OL].(2019-03-01)[2021-07-21].https://arxiv.org/abs/1901.09207v2. [28] FOERSTER J N,CHEN R Y,AL-SHEDIVAT M,et al.Learning with opponent-learning awareness[EB/OL].(2018-09-19)[2021-06-18].https://arxiv.org/abs/1709.04326. [29] DAVIES I,TIAN Z,WANG J.Learning to model opponent learning(student abstract)[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2020:13771-13772. [30] SOUTHEY F,BOWLING M P,LARSON B,et al.Bayes’ bluff:opponent modeling in poker[EB/OL].(2012-07-04)[2021-07-25].https://arxiv.org/abs/1207.1411. [31] GANZFRIED S,SUN Q.Bayesian opponent exploitation in imperfect-information games[C]//2018 IEEE Conference on Computational Intelligence and Games(CIG),2018:1-8. [32] ZHENG Y,MENG Z,HAO J,et al.A deep Bayesian policy reuse approach against non-stationary agents[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems,2018:962-972. [33] YU X,JIANG J,JIANG H,et al.Model-based opponent modeling[EB/OL].(2021-09-04)[2021-07-25].https://arxiv.org/abs/2108.01843. [34] HERNANDEZ-LEAL P,ROSMAN B,TAYLOR M E,et al.A Bayesian approach for learning and tracking switching,non-stationary opponents[C]//Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems,2016:1315-1316. [35] HARTFORD J S.Deep learning for predicting human strategic behavior[D].University of British Columbia,2016. [36] HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. [37] VANSCHOREN J.Meta-learning:a survey[EB/OL].(2018-10-08)[2021-07-28].https://arxiv.org/abs/1810.03548. [38] FINN C,ABBEEL P,LEVINE S.Model-agnostic meta-learning for fast adaptation of deep networks[C]//International Conference on Machine Learning,2017:1126-1135. [39] DE WEERD H,VERBRUGGE R,VERHEIJ B.How much does it help to know what she knows you know? An agent-based simulation study[J].Artificial Intelligence,2013,199:67-92. [40] TIAN R,TOMIZUKA M,SUN L.Learning human rewards by inferring their latent intelligence levels in multi-agent games:a theory-of-mind approach with application to driving data[C]//2021 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS),2021:4560-4567. [41] CAMERER C F,HO T H,CHONG J K.A cognitive hierarchy theory of one-shot games:some preliminary results[J].Levine’s Bibliography,2003,127(5):7-42. [42] WRIGHT J R,LEYTON-BROWN K.Level-0 models for predicting human behavior in games[J].Journal of Artificial Intelligence Research,2019,64:357-383. [43] DUAN Y,SCHULMAN J,CHEN X,et al.Rl2:fast reinforcement learning via slow reinforcement learning[EB/OL].(2016-11-10)[2021-07-30].https://arxiv.org/abs/1611. 02779v2. [44] LOWE R,WU Y I,TAMAR A,et al.Multi-agent actor-critic for mixed cooperative-competitive environments[J].Advances in Neural Information Processing Systems,2017,30. [45] RUSU A A,COLMENAREJO S G,GULCEHRE C,et al.Policy distillation[EB/OL].(2016-01-07)[2021-08-10].https://arxiv.org/abs/1511.06295. [46] BROWN N,SANDHOLM T.Superhuman AI for heads-up no-limit poker:libratus beats top professionals[J].Science,2018,359(6374):418-424. [47] 吴松.德州扑克中对手模型的研究[D].哈尔滨:哈尔滨工业大学,2013. WU S.Research of opponent modeling in Texas Hold’em[D].Harbin:Harbin Institute of Technology,2013. [48] 张加佳.非完备信息机器博弈中风险及对手模型的研究[D].哈尔滨:哈尔滨工业大学,2015. ZHANG J J.Research on risk and opponent modeling in imperfect information game[D].Harbin:Harbin Institute of Technology,2015. [49] 毛建博.基于虚拟自我对局的多人非完备信息机器博弈策略研究[D].哈尔滨:哈尔滨工业大学,2018. MAO J B.Research on multi-player imperfect information game strategy based on fictious self-play[D].Harbin:Harbin Institute of Technology,2018. [50] 吴天栋.非完备信息机器博弈算法及对手模型的研究[D].武汉:武汉理工大学,2018. WU T D.Research on incomplete information machine game algorithm and opponent model[D].Wuhan:Wuhan University of Technology,2018. [51] LI X,MIIKKULAINEN R.Opponent modeling and exploitation in poker using evolved recurrent neural networks[C]//Proceedings of the Genetic and Evolutionary Computation Conference,2018:189-196. [52] NASHED S,ZILBERSTEIN S.A survey of opponent modeling in adversarial domains[J].Journal of Artificial Intelligence Research,2022,73:277-327. [53] JOHANAON M B.Robust strategies and counter-strategies:from superhuman to optimal play[D].Alberta:University of Alberta,2016. [54] JOHANAON M B,BOWLING M,ZINKEVICH M.Computing robust counter-strategies[EB/OL].(2007-09-26)[2022-03-18].https://martin.zinkevich.org/publications/rnash.pdf. [55] OHANAON M B,BOWLING M.Data biased robust counter strategies[C]//Proceedings of 12th International Conference on Artificial Intelligence and Statistics,2009:264-271. [56] ABOU RISK N,SZAFRON D.Using counterfactual regret minimization to create competitive multiplayer poker agents[C]//International Conference on Autonomous Agents and Multiagent Systems,2010:159-166. [57] DIETTERICH T G.Ensemble learning[J].The handbook of Brain Theory and Neural Networks,2002,2(1):110-125. [58] EKMEKCI O,SIRIN V.Learning strategies for opponent modeling in poker[C]//Workshops at the Twenty-Seventh AAAI Conference on Artificial Intelligence,2013. [59] 张宏达,李德才,何玉庆.人工智能与“星际争霸”:多智能体博弈研究新进展[J].无人系统技术,2019,2(1):5-16. ZHANG H D,LI D C,HE Y Q.Artificial intelligence and StarCraft:new progress in multiagent game research[J].Unmanned Systems Technology,2019,2(1):5-16. [60] WEBER B G,MATEAS M.A data mining approach to strategy prediction[C]//2009 IEEE Symposium on Computational Intelligence and Games,2009:140-147. [61] URIARTE A,ONTANON S.Combat models for RTS games[J].IEEE Transactions on Games,2017,10(1):29-41. [62] SYNNAEVE G,BESSIERE P.A Bayesian model for opening prediction in RTS games with application to StarCraft[C]//2011 IEEE Conference on Computational Intelligence and Games(CIG’11),2011:281-288. [63] ONTANON S,BURO M.Adversarial hierarchical-task network planning for complex real-time games[C]//Twenty-Fourth International Joint Conference on Artificial Intelligence,2015. [64] LIN S,ANSHI Z,BO L,et al.HTN guided adversarial planning for RTS games[C]//2020 IEEE International Conference on Mechatronics and Automation(ICMA),2020:1326-1331. [65] BROWN G W.Iterative solution of games by fictitious play[J].Activity Analysis of Production and Allocation,1951,13(1):374-376. [66] HERINRICH J,LANCTOT M,SILVER D.Fictitious self-play in extensive-form games[C]//International Conference on Machine Learning,2015. [67] HEINRICH J,SILVER D.Deep reinforcement learning from self-play in imperfect-information games[EB/OL].(2016-06-28)[2021-08-12].https://arxiv.org/abs/1603.01121. [68] ZHANG L,WANG W,LI S,et al.Monte Carlo neural fictitious self-play:approach to approximate nash equilibrium of imperfect-information games[EB/OL].(2019-04-06)[2021-08-16].https://arxiv.org/abs/1903.09569v2. [69] LANCTOT M,ZAMBALDI V,GRUSIYS A,et al.A unified game-theoretic approach to multiagent reinforcement learning[EB/OL].(2017-11-07)[2021-08-20].https://arxiv.org/abs/1711.00832v1. [70] OMIDSHAFIEI S,PAPADIMITRIOU C,PILIOURAS G,et al.α-rank:multi-agent evaluation by evolution[J].Scientific Reports,2019,9(1):1-29. [71] MULLER P,OMIDSHAFIEI S,ROWLAND M,et al.A generalized training approach for multiagent learning[EB/OL].(2020-02-14)[2021-09-10].https://arxiv.org/abs/1909.12823v2. [72] MCALEER S,LANIER J,FOX R,et al.Pipeline psro:a scalable approach for finding approximate nash equilibria in large games[EB/OL].(2021-02-18)[2021-09-02].https://arxiv.org/abs/2006.08555v2. [73] TIAN Z,REN H,YANG Y,et al.Learning to safely exploit a non-stationary opponent[EB/OL].(2021-05-22)[2021-09-14].https://openreview.net/pdf?id=zoQJBVrhnn3. [74] LIU M,WU C,LIU Q,et al.Safe opponent-exploitation subgame refinement[EB/OL].(2021-09-29)[2022-01-14].https://openreview.net/pdf?id=VwSHZgruNEc. [75] 袁唯淋,廖志勇,高巍,等.计算机扑克智能博弈研究综述[J].网络与信息安全学报,2021,7(5):57-76. YUAN W L,LIAO Z Y,GAO W,et al.Survey on intelligent game of computer poker[J].Chinese Journal of Network and Information Security,2021,7(5):57-76. |
[1] | 高敬鹏, 胡欣瑜, 江志烨. 改进DDPG无人机航迹规划算法[J]. 计算机工程与应用, 2022, 58(8): 264-272. |
[2] | 赵庶旭, 元琳, 张占平. 多智能体边缘计算任务卸载[J]. 计算机工程与应用, 2022, 58(6): 177-182. |
[3] | 邓心, 那俊, 张瀚铎, 王昱林, 张斌. 基于深度强化学习的智能灯个性化调节方法[J]. 计算机工程与应用, 2022, 58(6): 264-270. |
[4] | 徐博, 周建国, 吴静, 罗威. 可编程数据平面下基于DDPG的路由优化方法[J]. 计算机工程与应用, 2022, 58(3): 143-150. |
[5] | 宋浩楠, 赵刚, 孙若莹. 基于深度强化学习的知识推理研究进展综述[J]. 计算机工程与应用, 2022, 58(1): 12-25. |
[6] | 牛鹏飞, 王晓峰, 芦磊, 张九龙. 强化学习在车辆路径问题中的研究综述[J]. 计算机工程与应用, 2022, 58(1): 41-55. |
[7] | 马志豪,朱响斌. 拟双曲动量梯度的对抗深度强化学习研究[J]. 计算机工程与应用, 2021, 57(24): 90-99. |
[8] | 李宝帅,叶春明. 深度强化学习算法求解作业车间调度问题[J]. 计算机工程与应用, 2021, 57(23): 248-254. |
[9] | 成怡,郝密密. 改进深度强化学习的室内移动机器人路径规划[J]. 计算机工程与应用, 2021, 57(21): 256-262. |
[10] | 况立群,李思远,冯利,韩燮,徐清宇. 深度强化学习算法在智能军事决策中的应用[J]. 计算机工程与应用, 2021, 57(20): 271-278. |
[11] | 孔松涛,刘池池,史勇,谢义,王堃. 深度强化学习在智能制造中的应用展望综述[J]. 计算机工程与应用, 2021, 57(2): 49-59. |
[12] | 张荣霞,武长旭,孙同超,赵增顺. 深度强化学习及在路径规划中的研究进展[J]. 计算机工程与应用, 2021, 57(19): 44-56. |
[13] | 杨薛钰,陈建平,傅启明,陆悠,吴宏杰. 基于随机方差减小方法的DDPG算法[J]. 计算机工程与应用, 2021, 57(19): 104-111. |
[14] | 宋浩楠,赵刚,王兴芬. 融合知识表示和深度强化学习的知识推理方法[J]. 计算机工程与应用, 2021, 57(19): 189-197. |
[15] | 杨彤,秦进. 基于平均序列累计奖赏的自适应ε-greedy策略[J]. 计算机工程与应用, 2021, 57(11): 148-155. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||