We prove under commonly used assumptions the convergence of actor-critic reinforcement learning algorithms, which simultaneously learn a policy function, the actor, and a value function, the critic. Both functions can be deep neural networks of arbitrary complexity. Our framework allows showing convergence of the well known Proximal Policy Optimization (PPO) and of the recently introduced RUDDER. For the convergence proof we employ recently introduced techniques from the two time-scale stochastic approximation theory. Our results are valid for actor-critic methods that use episodic samples and that have a policy that becomes more greedy during learning. Previous convergence proofs assume linear function approximation, cannot treat episodic examples, or do not consider that policies become greedy. The latter is relevant since optimal policies are typically deterministic.
翻译:在常用的假设下,我们证明行为者-批评强化学习算法的趋同,这些算法同时学习了政策功能、行为者和价值功能,批评家。这两种功能可以是任意复杂的深层神经网络。我们的框架可以显示众所周知的普罗克西马政策优化和最近推出的RUDDER的趋同。关于趋同证据,我们最近从两个时间尺度的随机近似理论中采用了各种技术。我们的结果适用于在学习过程中更加贪婪的政策,即使用偶发样本的行为者-批评方法。 以前的趋同证据具有线性功能近似,不能处理偶发性的例子,或者不认为政策变得贪婪。 后者具有相关性,因为最佳政策通常具有确定性。