We present the first provably convergent two-timescale off-policy actor-critic algorithm (COF-PAC) with function approximation. Key to COF-PAC is the introduction of a new critic, the emphasis critic, which is trained via Gradient Emphasis Learning (GEM), a novel combination of the key ideas of Gradient Temporal Difference Learning and Emphatic Temporal Difference Learning. With the help of the emphasis critic and the canonical value function critic, we show convergence for COF-PAC, where the critics are linear and the actor can be nonlinear.
翻译:我们展示了第一个可能一致的、具有功能近似作用的两期非政策性行为者-批评算法(COF-PAC ) 。 COF-PAC的关键是引入一个新的评论家,即重点评论家,该评论家通过 " 重力学习 " (GEM)培训,这是渐进式时间差异学习和情感时间差异学习等关键理念的新组合。 在重点评论家和弹性价值函数评论家的帮助下,我们展示了COF-PAC的趋同,这里的批评家是线性,而行为者可以是非线性。