We study the global convergence of policy optimization for finding the Nash equilibria (NE) in zero-sum linear quadratic (LQ) games. To this end, we first investigate the landscape of LQ games, viewing it as a nonconvex-nonconcave saddle-point problem in the policy space. Specifically, we show that despite its nonconvexity and nonconcavity, zero-sum LQ games have the property that the stationary point of the objective function with respect to the linear feedback control policies constitutes the NE of the game. Building upon this, we develop three projected nested-gradient methods that are guaranteed to converge to the NE of the game. Moreover, we show that all of these algorithms enjoy both globally sublinear and locally linear convergence rates. Simulation results are also provided to illustrate the satisfactory convergence properties of the algorithms. To the best of our knowledge, this work appears to be the first one to investigate the optimization landscape of LQ games, and provably show the convergence of policy optimization methods to the Nash equilibria. Our work serves as an initial step toward understanding the theoretical aspects of policy-based reinforcement learning algorithms for zero-sum Markov games in general.
翻译:我们研究在零和线性二次曲线(LQ)游戏中找到 Nash 平衡( NE) 的政策优化的全球趋同性。 为此,我们首先调查LQ游戏的景象, 把它看成是政策空间中非convex- noncolcave 搭配点问题。 具体地说, 我们显示,尽管它非共性和非共性, 零和LQ游戏具有这样的特性, 即线性反馈控制政策的目标功能的固定点构成游戏的NE。 我们在此基础上, 我们开发了三种预测的嵌套式渐进方法, 保证与游戏的NE 趋同。 此外, 我们展示了所有这些算法都享有全球的子线性和本地线性趋同率。 我们还提供了模拟结果, 以说明算法的令人满意的趋同性。 据我们所知, 这项工作似乎是第一个调查LQ 游戏最优化环境的固定点, 并准确地显示政策优化方法与 Nash equiliririria 的趋同性。 我们的工作是在理解以理论为基础的政策演算法方面的初步步骤。