Efficient exploration is a crucial challenge in deep reinforcement learning. Several methods, such as behavioral priors, are able to leverage offline data in order to efficiently accelerate reinforcement learning on complex tasks. However, if the task at hand deviates excessively from the demonstrated task, the effectiveness of such methods is limited. In our work, we propose to learn features from offline data that are shared by a more diverse range of tasks, such as correlation between actions and directedness. Therefore, we introduce state-independent temporal priors, which directly model temporal consistency in demonstrated trajectories, and are capable of driving exploration in complex tasks, even when trained on data collected on simpler tasks. Furthermore, we introduce a novel integration scheme for action priors in off-policy reinforcement learning by dynamically sampling actions from a probabilistic mixture of policy and action prior. We compare our approach against strong baselines and provide empirical evidence that it can accelerate reinforcement learning in long-horizon continuous control tasks under sparse reward settings.
翻译:高效探索是深层强化学习中的一个关键挑战。 行为前科等几种方法能够利用离线数据来高效加快强化复杂任务学习。 但是,如果手头的任务与所展示的任务过于偏离,这些方法的效力是有限的。 在我们的工作中,我们提议从离线数据中学习由更多种多样的任务共享的特征,例如行动和定向之间的相互关系。 因此,我们引入了国家独立时间前科,直接模拟了所显示轨道的时间一致性,并能够推动复杂任务的探索,即使是在对收集的简单任务的数据进行培训时也是如此。 此外,我们引入了一个新的整合计划,通过动态抽样行动,从以往的政策和行动的概率组合中学习非政策强化行动前科。我们将我们的方法与强有力的基线进行比较,并提供经验证据,说明在微弱的奖励环境下,它可以加快长期和连续控制任务中的学习。