We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents (i.e., zero-sum stochastic games). We consider an episodic setting where in each episode, each player independently selects a policy and observes only their own actions and rewards, along with the state. We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule (which is necessary). To the best of our knowledge, this constitutes the first finite-sample convergence result for independent policy gradient methods in competitive RL; prior work has largely focused on centralized, coordinated procedures for equilibrium computation.
翻译:我们获得了在竞争性强化学习环境中独立学习算法的全球性、非不受约束的趋同性保障,有两个代理商(即零和随机游戏 ) 。 我们考虑的是一个偶发环境,每个玩家在每一集中独立选择一项政策,并只观察他们自己的行动和奖赏以及国家。 我们显示,如果两个玩家同时使用政策梯度方法,他们的政策将趋同于游戏的微量平衡,只要他们的学习率遵循两步制规则(这是必要的 ) 。 据我们所知,这构成了竞争性RL中独立政策梯度方法的第一个有限模范趋同结果;先前的工作主要侧重于均衡计算集中、协调的程序。