Recent progress in deep learning has relied on access to large and diverse datasets. Such data-driven progress has been less evident in offline reinforcement learning (RL), because offline RL data is usually collected to optimize specific target tasks limiting the data's diversity. In this work, we propose Exploratory data for Offline RL (ExORL), a data-centric approach to offline RL. ExORL first generates data with unsupervised reward-free exploration, then relabels this data with a downstream reward before training a policy with offline RL. We find that exploratory data allows vanilla off-policy RL algorithms, without any offline-specific modifications, to outperform or match state-of-the-art offline RL algorithms on downstream tasks. Our findings suggest that data generation is as important as algorithmic advances for offline RL and hence requires careful consideration from the community.
翻译:最近深层学习的进展依赖于对大型和多种数据集的访问。这类数据驱动的进展在离线强化学习(RL)中不那么明显,因为离线RL数据通常被收集,以优化限制数据多样性的具体目标任务。在这项工作中,我们提议为离线RL(ExORL)探索数据,这是对离线RL(ExORL)以数据为中心的一种以数据为中心的方法。ExORL首先以不受监督的无报酬勘探方式生成数据,然后在与离线RL培训一项政策之前用下游奖项重新标注这些数据。我们发现,探索性数据允许香草离线的离线RL算法在不作任何离线修改的情况下,在下游任务上超越或匹配最新的离线RL算法。我们的调查结果表明,数据生成与离线RL的算法进步一样重要,因此需要社区仔细考虑。