This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a $\gamma$-discounted infinite-horizon Markov game with $S$ states, where the max-player has $A$ actions and the min-player has $B$ actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an $\varepsilon$-approximate Nash equilibrium with a sample complexity no larger than $\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-\gamma)^{3}\varepsilon^{2}}$ (up to some log factor). Here, $C_{\mathsf{clipped}}^{\star}$ is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-\`a-vis the target data), and the target accuracy $\varepsilon$ can be any value within $\big(0,\frac{1}{1-\gamma}\big]$. Our sample complexity bound strengthens prior art by a factor of $\min\{A,B\}$, achieving minimax optimality for the entire $\varepsilon$-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.
翻译:本文在从离线数据中学习 Nash 零和 Markov 双玩游戏中的 Nash 等离线 数据 中的进展。 具体地说, 考虑一个 $\ gamma $- 折现的无限horizon Markov 游戏, $S$, 最大玩家有 $A 动作, 分钟玩家有 $B$ 动作 。 我们建议用 Bernstein 风格的低信任范围( 称为 VI- LCB- Game ) 进行基于悲观模型的算法。 它可以找到一个 $\ varepsilon- 近于 Nash 平衡, 样本复杂程度不超过$( $- Cámfsl) ; 目标精度( 1- g- g- gstar} S&S& S ( A+B) +B +_\ +\ +\ +\ ralisality rality $_ a train exlievationslationslationslation_ a liver lievleglemental_ a exlation_ exlation_ a broup exlation_ a exlation_ exlation_ exlation exlation exlup 美元) exlup exlup__ exlationslup__ exlation_ 美元, exluplationslationslation_ 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元=_ rublation 美元, 美元= 美元= a 美元=__ 美元, 美元= a 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元= a 美元= a 美元, ruislupluplupl= a 美元, ruislislation_ 美元, 美元, 美元, rulation_ rulislislislislislislationalislation 美元, ruislation 美元, 可以提高