Model-based Reinforcement Learning (MBRL) algorithms have been traditionally designed with the goal of learning accurate dynamics of the environment. This introduces a mismatch between the objectives of model-learning and the overall learning problem of finding an optimal policy. Value-aware model learning, an alternative model-learning paradigm to maximum likelihood, proposes to inform model-learning through the value function of the learnt policy. While this paradigm is theoretically sound, it does not scale beyond toy settings. In this work, we propose a novel value-aware objective that is an upper bound on the absolute performance difference of a policy across two models. Further, we propose a general purpose algorithm that modifies the standard MBRL pipeline -- enabling learning with value aware objectives. Our proposed objective, in conjunction with this algorithm, is the first successful instantiation of value-aware MBRL on challenging continuous control environments, outperforming previous value-aware objectives and with competitive performance w.r.t. MLE-based MBRL approaches.
翻译:以模型为基础的强化学习算法(MBRL)传统上设计的目的是学习准确的环境动态。这造成模型学习目标与寻找最佳政策的总体学习问题之间的不匹配。价值意识模型学习是一种尽可能可能的替代模式学习模式,建议通过所学政策的价值观功能为模型学习提供信息。虽然这种模式在理论上是健全的,但并不超越了玩具环境。在这项工作中,我们提出了一个新颖的价值意识目标,该目标对两种模式的政策的绝对绩效差异具有上限。此外,我们提出了一种通用目的算法,以修改标准的MBRL管道 -- -- 使学习与有价值意识的目标相适应。我们提出的目标与这一算法一起,是首次成功地将价值意识模型学习与挑战持续控制环境、超过以往的价值意识目标和具有竞争力的MLEMBRL方法同步。