While reinforcement learning (RL) methods that learn an internal model of the environment have the potential to be more sample efficient than their model-free counterparts, learning to model raw observations from high dimensional sensors can be challenging. Prior work has addressed this challenge by learning low-dimensional representation of observations through auxiliary objectives, such as reconstruction or value prediction. However, the alignment between these auxiliary objectives and the RL objective is often unclear. In this work, we propose a single objective which jointly optimizes a latent-space model and policy to achieve high returns while remaining self-consistent. This objective is a lower bound on expected returns. Unlike prior bounds for model-based RL on policy exploration or model guarantees, our bound is directly on the overall RL objective. We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods. While sample efficient methods typically are computationally demanding, our method attains the performance of SAC in about 50% less wall-clock time.
翻译:虽然学习环境内部模型的强化学习方法(RL)有可能比不使用模型的对应方法更具样板效率,但学习从高维传感器模拟原始观测可能具有挑战性。先前的工作通过辅助目标(如重建或价值预测)学习了低维的观察表示方式,从而应对了这一挑战。然而,这些辅助目标与RL目标的协调统一往往不明确。在这项工作中,我们提出了一个单一的目标,即共同优化潜空模型和政策,以便在保持自我一致性的同时实现高回报。这个目标对预期回报的制约较低。与以前基于模型的RL的政策探索或模型保证的界限不同,我们的约束直接取决于RL的总体目标。我们证明,由此产生的算法匹配或提高了以前基于模型和不使用模型的最佳RL方法的抽样效率。虽然样本效率方法通常具有计算上的要求,但我们的方法在大约50%的时钟里达到了SAC的性能。</s>