The oscillating performance of off-policy learning and persisting errors in the actor-critic (AC) setting call for algorithms that can conservatively learn to suit the stability-critical applications better. In this paper, we propose a novel off-policy AC algorithm cautious actor-critic (CAC). The name cautious comes from the doubly conservative nature that we exploit the classic policy interpolation from conservative policy iteration for the actor and the entropy-regularization of conservative value iteration for the critic. Our key observation is the entropy-regularized critic facilitates and simplifies the unwieldy interpolated actor update while still ensuring robust policy improvement. We compare CAC to state-of-the-art AC methods on a set of challenging continuous control problems and demonstrate that CAC achieves comparable performance while significantly stabilizes learning.
翻译:行为者-批评(AC)设置了一种算法,可以保守地学会更好地适应稳定性关键应用。在本文中,我们提出了一个新的非政策性AC算法谨慎的行为者-批评(CAC ) 。 谨慎的取名来自一种双重保守的保守性质,即我们利用保守的行为者政策迭代和保守价值迭代的批评者保守价值迭代的典型政策内插。我们的主要观察是,加密正规化的批评家推动和简化了非机械化的内插的行为者更新,同时仍然确保有力的政策改进。我们把CAC比作一套挑战持续控制问题的最新AC方法,并表明CAC在显著稳定学习的同时取得了相似的业绩。