We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play. Precisely, we focus on two-player, zero-sum, episodic, tabular IIG under the perfect-recall assumption where the only feedback is realizations of the game (bandit feedback). In particular, the dynamic of the IIG is not known -- we can only access it by sampling or interacting with a game simulator. For this learning setting, we provide the Implicit Exploration Online Mirror Descent (IXOMD) algorithm. It is a model-free algorithm with a high-probability bound on the convergence rate to the NE of order $1/\sqrt{T}$ where $T$ is the number of played games. Moreover, IXOMD is computationally efficient as it needs to perform the updates only along the sampled trajectory.
翻译:我们通过自玩来研究在不完善的信息游戏中学习纳什平衡(NE)的问题。 确切地说, 我们注重于在完全回召假设下的两个玩家, 零和、 偶发、 表格IIG, 唯一的反馈是游戏的实现( 黑道反馈 ) 。 特别是, IIG的动态并不为人所知 -- -- 我们只能通过取样或与游戏模拟器互动来访问它。 对于这个学习环境, 我们提供了隐性探索在线光源( IXOMD) 算法( IXOMD ) 。 这是一个无模型的算法, 其高度概率取决于顺序 $/\ sqrt{T} $( $T$是游戏次数的NE) 的合并率。 此外, IXOMD 计算效率很高, 因为它需要按照抽样轨迹进行更新。