Sample efficiency and exploration remain major challenges in online reinforcement learning (RL). A powerful approach that can be applied to address these issues is the inclusion of offline data, such as prior trajectories from a human expert or a sub-optimal exploration policy. Previous methods have relied on extensive modifications and additional complexity to ensure the effective use of this data. Instead, we ask: can we simply apply existing off-policy methods to leverage offline data when learning online? In this work, we demonstrate that the answer is yes; however, a set of minimal but important changes to existing off-policy RL algorithms are required to achieve reliable performance. We extensively ablate these design choices, demonstrating the key factors that most affect performance, and arrive at a set of recommendations that practitioners can readily apply, whether their data comprise a small number of expert demonstrations or large volumes of sub-optimal trajectories. We see that correct application of these simple recommendations can provide a $\mathbf{2.5\times}$ improvement over existing approaches across a diverse set of competitive benchmarks, with no additional computational overhead.
翻译:样本效率和勘探仍然是在线强化学习(RL)的重大挑战。 可用于解决这些问题的强有力的方法是纳入离线数据,例如人类专家先前的轨迹或亚最佳勘探政策。 以往的方法依靠广泛的修改和额外的复杂性来确保这些数据的有效利用。 相反,我们问:我们能否在网上学习时,仅仅应用现有的离政策方法来利用离线数据?在这项工作中,我们证明答案是肯定的;然而,要实现可靠的性能,需要对现有离政策RL算法进行一系列微小但重要的修改。我们广泛扩大这些设计选择,展示了影响业绩的最关键因素,并提出了一套从业人员可以随时应用的建议,无论它们的数据包括少量的专家演示或大量的次最佳轨迹。我们发现,正确应用这些简单建议可以提供1美元\mathbf{2.5\phytime}相对于一套不同的竞争性基准的现有方法,而不是额外的计算间接费用。