In offline reinforcement learning (RL), one detrimental issue to policy learning is the error accumulation of deep Q function in out-of-distribution (OOD) areas. Unfortunately, existing offline RL methods are often over-conservative, inevitably hurting generalization performance outside data distribution. In our study, one interesting observation is that deep Q functions approximate well inside the convex hull of training data. Inspired by this, we propose a new method, DOGE (Distance-sensitive Offline RL with better GEneralization). DOGE marries dataset geometry with deep function approximators in offline RL, and enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution. Specifically, DOGE trains a state-conditioned distance function that can be readily plugged into standard actor-critic methods as a policy constraint. Simple yet elegant, our algorithm enjoys better generalization compared to state-of-the-art methods on D4RL benchmarks. Theoretical analysis demonstrates the superiority of our approach to existing methods that are solely based on data distribution or support constraints.
翻译:在离线强化学习(RL)中,对政策学习有害的一个问题是,在分布区外(OOD)深Q功能的错误积累。 不幸的是,现有的离线RL方法往往过于保守,不可避免地伤害数据分布以外的一般性工作。在我们的研究中,一个有趣的观察是,深度Q功能在培训数据的锥体内很接近。受此启发,我们提出了一个新方法,即DGE(对区分敏感的离线脱线RL,更好的Generalization)。DOGE将数据设置的几何与离线区外(OOOD)的深功能对应,并允许在可通用OOD区域进行利用,而不是在数据分布内严格限制政策。具体地说,DGE培训一个有条件的远程功能,可以很容易地插入标准的行为者-批评方法,作为一种政策制约。简单而优雅的我们的算法比D4RL基准的状态方法更加普遍化。理论分析表明,我们采用的方法优于仅基于数据分布或支持限制的现有方法。