Offline reinforcement learning (RL) addresses the problem of learning a performant policy from a fixed batch of data collected by following some behavior policy. Model-based approaches are particularly appealing in the offline setting since they can extract more learning signals from the logged dataset by learning a model of the environment. However, the performance of existing model-based approaches falls short of model-free counterparts, due to the compounding of estimation errors in the learned model. Driven by this observation, we argue that it is critical for a model-based method to understand when to trust the model and when to rely on model-free estimates, and how to act conservatively w.r.t. both. To this end, we derive an elegant and simple methodology called conservative Bayesian model-based value expansion for offline policy optimization (CBOP), that trades off model-free and model-based estimates during the policy evaluation step according to their epistemic uncertainties, and facilitates conservatism by taking a lower bound on the Bayesian posterior value estimate. On the standard D4RL continuous control tasks, we find that our method significantly outperforms previous model-based approaches: e.g., MOPO by $116.4$%, MOReL by $23.2$% and COMBO by $23.7$%. Further, CBOP achieves state-of-the-art performance on $11$ out of $18$ benchmark datasets while doing on par on the remaining datasets.
翻译:离线强化学习(RL) 解决了从根据某些行为政策收集的固定数据中学习执行政策的问题。基于模型的方法在离线设置中特别具有吸引力,因为它们可以通过学习环境模型从登录数据集中提取更多的学习信号。然而,现有基于模型的方法的效绩低于不使用模型的对应方,原因是所学模型中估算错误的复合性。受这一观察的驱使,我们认为,基于模型的方法对于了解何时信任模型和何时依赖不使用模型的估计数以及如何保守地采取W.r.t的估计数至关重要。为此,我们推出一种优雅而简单的方法,称为保守的Bayesian基于模型的值扩大,用于离线政策优化(CBOP),在政策评价步骤中,由于对所学模型的不确定性的复合性差错进行交换,并且通过对Bayesian posior值的估计数采取较低的约束度,我们的方法在标准 D4RL 持续控制任务上大大超出R.r.tal $ 保守的W.r.t。为此,我们发现我们的方法大大超越了以保守的Base-rbas-al MO2.BAS 之前的数据BOBAS 。</s>