Policy gradient methods are widely used in solving two-player zero-sum games to achieve superhuman performance in practice. However, it remains elusive when they can provably find a near-optimal solution and how many samples and iterations are needed. The current paper studies natural extensions of Natural Policy Gradient algorithm for solving two-player zero-sum games where function approximation is used for generalization across states. We thoroughly characterize the algorithms' performance in terms of the number of samples, number of iterations, concentrability coefficients, and approximation error. To our knowledge, this is the first quantitative analysis of policy gradient methods with function approximation for two-player zero-sum Markov games.
翻译:政策梯度方法被广泛用于解决两个玩家零和游戏,以便在实践中实现超人性能。然而,当他们能够找到近乎最佳的解决方案,以及需要多少样本和迭代时,政策梯度方法仍然难以实现。目前的文件研究自然政策梯度算法的自然延伸,以解决两个玩家零和游戏,其中功能近似用于各州的概括化。我们从样本数量、迭代次数、集中系数和近似错误等方面彻底描述算法的性能。 据我们了解,这是对政策梯度方法的首次定量分析,其功能近似值为两个玩家零和马尔科夫游戏。