Counterfactual Regret Minimization (CFR) has found success in settings like poker which have both terminal states and perfect recall. We seek to understand how to relax these requirements. As a first step, we introduce a simple algorithm, local no-regret learning (LONR), which uses a Q-learning-like update rule to allow learning without terminal states or perfect recall. We prove its convergence for the basic case of MDPs (and limited extensions of them) and present empirical results showing that it achieves last iterate convergence in a number of settings, most notably NoSDE games, a class of Markov games specifically designed to be challenging to learn where no prior algorithm is known to achieve convergence to a stationary equilibrium even on average.
翻译:反事实后悔最小化(CFR)在扑克牌等同时具有终端状态和完美回忆的场合中取得了成功。 我们试图了解如何放松这些要求。 作为第一步,我们引入了简单的算法,即本地无记录学习(LONR ), 使用类似于Q-学习的更新规则,允许没有终端状态或完美回忆的学习。 我们证明了它对于MDP的基本案例(以及它们的有限扩展)的趋同,并提供了经验性结果,表明它在许多环境(特别是NOSDE游戏)中最终实现了循环趋同。 诺斯-斯-德游戏是一批专门设计来挑战性的马可夫游戏,目的是在即使平均也无法实现与固定平衡趋同的地方学习已知的任何先前算法。