In recent years, deep off-policy actor-critic algorithms have become a dominant approach to reinforcement learning for continuous control. This comes after a series of breakthroughs to address function approximation errors, which previously led to poor performance. These insights encourage the use of pessimistic value updates. However, this discourages exploration and runs counter to theoretical support for the efficacy of optimism in the face of uncertainty. So which approach is best? In this work, we show that the optimal degree of optimism can vary both across tasks and over the course of learning. Inspired by this insight, we introduce a novel deep actor-critic algorithm, Dynamic Optimistic and Pessimistic Estimation (DOPE) to switch between optimistic and pessimistic value learning online by formulating the selection as a multi-arm bandit problem. We show in a series of challenging continuous control tasks that DOPE outperforms existing state-of-the-art methods, which rely on a fixed degree of optimism. Since our changes are simple to implement, we believe these insights can be extended to a number of off-policy algorithms.
翻译:近年来,深层的从政策角度出发的行为者-批评算法已成为加强持续控制学习的主导方法。 这是在一系列解决功能近似误差的突破之后出现的,这些突破曾导致业绩不佳。 这些洞察力鼓励使用悲观价值更新。 但是,这不利于探索,也有悖于在面对不确定性的情况下乐观主义有效性的理论支持。 因此,哪种方法才是最好的? 在这项工作中,我们表明,最佳的乐观度可以因任务和学习过程而不同。 受这一洞察力的启发,我们引入了一种新的深层次的行为者-批评算法、动态乐观和悲观主义估计(DOPE),通过将选择设计成多波段问题,在网上转换乐观和悲观的价值学习。 我们用一系列具有挑战性的持续控制任务显示,DAPE超越了现有最先进的方法,这些方法依赖于固定的乐观度。由于我们实施的变化很简单,我们认为这些洞察力可以扩展到一些非政策算法。