We consider learning Nash equilibria in two-player zero-sum Markov Games with nonlinear function approximation, where the action-value function is approximated by a function in a Reproducing Kernel Hilbert Space (RKHS). The key challenge is how to do exploration in the high-dimensional function space. We propose a novel online learning algorithm to find a Nash equilibrium by minimizing the duality gap. At the core of our algorithms are upper and lower confidence bounds that are derived based on the principle of optimism in the face of uncertainty. We prove that our algorithm is able to attain an $O(\sqrt{T})$ regret with polynomial computational complexity, under very mild assumptions on the reward function and the underlying dynamic of the Markov Games. We also propose several extensions of our algorithm, including an algorithm with Bernstein-type bonus that can achieve a tighter regret bound, and another algorithm for model misspecification that can be applied to neural function approximation.
翻译:我们考虑在两个玩家零和马尔科夫游戏中学习纳什平衡,使用非线性函数近似值的非线性函数来学习纳什平衡,在这个游戏中,行动价值功能被复制Kernel Hilbert空间(RKHS)的一个函数所近似。关键的挑战是如何在高维功能空间进行探索。我们提出一种新的在线学习算法,以通过减少双重性差距来找到纳什平衡。在我们的算法的核心是基于乐观原则的上限和较低的信任界限。我们证明我们的算法能够达到美元(sqrt{T})的多数值计算复杂性,这是在对马尔科夫运动的奖赏功能和基本动态的非常温和假设下。我们还提议了我们的算法的若干扩展,包括伯恩斯坦型奖金的算法,它可以实现更强烈的遗憾约束,以及另一个用于神经功能近似的模型错误定位的算法。