We consider the problem of offline reinforcement learning with model-based control, whose goal is to learn a dynamics model from the experience replay and obtain a pessimism-oriented agent under the learned model. Current model-based constraint includes explicit uncertainty penalty and implicit conservative regularization that pushes Q-values of out-of-distribution state-action pairs down and the in-distribution up. While the uncertainty estimation, on which the former relies on, can be loosely calibrated for complex dynamics, the latter performs slightly better. To extend the basic idea of regularization without uncertainty quantification, we propose distributionally robust offline model-based policy optimization (DROMO), which leverages the ideas in distributionally robust optimization to penalize a broader range of out-of-distribution state-action pairs beyond the standard empirical out-of-distribution Q-value minimization. We theoretically show that our method optimizes a lower bound on the ground-truth policy evaluation, and it can be incorporated into any existing policy gradient algorithms. We also analyze the theoretical properties of DROMO's linear and non-linear instantiations.
翻译:我们考虑了利用基于模型的控制进行离线强化学习的问题,其目的在于从经验重现中学习动态模型,并在所学模型下获得一个面向悲观主义的代理人。目前基于模型的制约包括明确的不确定性处罚和隐性保守的规范化,将分配之外的国家行动配对的Q值降低和在分配中提升。虽然前者所依赖的不确定性估算可以松散地调整,以适应复杂的动态,但后者的表现略好一些。为了在无不确定性量化的情况下扩展正规化的基本理念,我们提出了基于分配的强健的基于模型的政策优化(DROMO),它利用分布上稳健的优化理念来惩罚超出分配范围、超出标准的经验分配之外的国家行动配对。我们理论上表明,我们的方法优化了对地图政策评估的较低约束,可以纳入任何现有的政策梯度算法。我们还分析了DROMO线性和非线性瞬时的理论属性。