掌握无示范多剂强化学习战略游戏 (Mastering the Game of Stratego with Model-Free Multiagent Reinforcement Learning)

Julien Perolat,Bart de Vylder,Daniel Hennes,Eugene Tarassov,Florian Strub,Vincent de Boer,Paul Muller,Jerome T. Connor,Neil Burch,Thomas Anthony,Stephen McAleer,Romuald Elie,Sarah H. Cen,Zhe Wang,Audrunas Gruslys,Aleksandra Malysheva,Mina Khan,Sherjil Ozair,Finbarr Timbers,Toby Pohlen,Tom Eccles,Mark Rowland,Marc Lanctot,Jean-Baptiste Lespiau,Bilal Piot,Shayegan Omidshafiei,Edward Lockhart,Laurent Sifre,Nathalie Beauguerlange,Remi Munos,David Silver,Satinder Singh,Demis Hassabis,Karl Tuyls

We introduce DeepNash, an autonomous agent capable of learning to play the imperfect information game Stratego from scratch, up to a human expert level. Stratego is one of the few iconic board games that Artificial Intelligence (AI) has not yet mastered. This popular game has an enormous game tree on the order of $10^{535}$ nodes, i.e., $10^{175}$ times larger than that of Go. It has the additional complexity of requiring decision-making under imperfect information, similar to Texas hold'em poker, which has a significantly smaller game tree (on the order of $10^{164}$ nodes). Decisions in Stratego are made over a large number of discrete actions with no obvious link between action and outcome. Episodes are long, with often hundreds of moves before a player wins, and situations in Stratego can not easily be broken down into manageably-sized sub-problems as in poker. For these reasons, Stratego has been a grand challenge for the field of AI for decades, and existing AI methods barely reach an amateur level of play. DeepNash uses a game-theoretic, model-free deep reinforcement learning method, without search, that learns to master Stratego via self-play. The Regularised Nash Dynamics (R-NaD) algorithm, a key component of DeepNash, converges to an approximate Nash equilibrium, instead of 'cycling' around it, by directly modifying the underlying multi-agent learning dynamics. DeepNash beats existing state-of-the-art AI methods in Stratego and achieved a yearly (2022) and all-time top-3 rank on the Gravon games platform, competing with human expert players.

翻译：我们引入了DeepNash, 一个能够从零到零学习不完善的信息游戏 Stratego 的自主代理机构。 Stratego 是一个小得多的游戏树( 10 164美元左右的节点 ) 。 Stratego 是人工智能( AI) 尚未掌握的为数不多的标志性棋盘游戏游戏游戏之一。这个流行游戏有着巨大的游戏树, 大约10 535美元, 也就是10 175美元比Go的多倍。它在不完善的信息下要求决策, 类似于得克萨斯州持有的扑克牌, 它有相当小得多的游戏树( 10 164美元左右的点点点点点 ) 。 Stratego 的决策是在大量的离散的游戏游戏游戏游戏游戏游戏游戏游戏游戏游戏游戏中做出的, 动作和结果之间没有明显的联系。 Epismodes 时间很长, 在玩家赢之前经常有数百次的动作, Stratego 的情况不会轻易破解成在扑克牌上管理规模的子。由于这些原因, Stratgogogoego 一直以来, 一直以来, 在AI 实地的游戏中, 的游戏中, 不断的游戏中, 不断的游戏中, 不断的游戏中, 正在学习的游戏中, 正在学习一个不断的游戏中, 正在学习一个不断的游戏的游戏的游戏的游戏的游戏的游戏的游戏的游戏的游戏的游戏的阶流的阶流的阶流的阶流。