One of the key challenges of Reinforcement Learning (RL) is the ability of agents to generalise their learned policy to unseen settings. Moreover, training RL agents requires large numbers of interactions with the environment. Motivated by the recent success of Offline RL and Imitation Learning (IL), we conduct a study to investigate whether agents can leverage offline data in the form of trajectories to improve the sample-efficiency in procedurally generated environments. We consider two settings of using IL from offline data for RL: (1) pre-training a policy before online RL training and (2) concurrently training a policy with online RL and IL from offline data. We analyse the impact of the quality (optimality of trajectories) and diversity (number of trajectories and covered level) of available offline trajectories on the effectiveness of both approaches. Across four well-known sparse reward tasks in the MiniGrid environment, we find that using IL for pre-training and concurrently during online RL training both consistently improve the sample-efficiency while converging to optimal policies. Furthermore, we show that pre-training a policy from as few as two trajectories can make the difference between learning an optimal policy at the end of online training and not learning at all. Our findings motivate the widespread adoption of IL for pre-training and concurrent IL in procedurally generated environments whenever offline trajectories are available or can be generated.
翻译:强化学习(RL)的一个关键挑战是使代理能够将其学习到的策略推广到不可见的环境中。此外,训练RL代理需要大量与环境的交互。受最近离线RL和模仿学习(IL)的成功启发,我们进行了一项研究,以调查代理是否可以利用轨迹的离线数据形式来提高程序生成环境中的采样效率。我们考虑使用离线数据中的IL用于RL的两种情况:(1)在线RL训练之前的策略预训练,和(2)在线RL和IL的并行训练。我们分析了可用的离线轨迹的质量(轨迹的最优性)和多样性(轨迹的数量和覆盖的程度)对两种方法有效性的影响。
在MiniGrid环境中进行四项知名的稀疏奖励任务,我们发现在预先训练和在线RL训练期间同时使用IL都可以提高采样效率并收敛于最优策略。此外,我们还展示了从仅两个轨迹中预训练策略可以让学习在线训练期间得到最优策略与根本不能学习之间的差异。我们发现,当离线轨迹数据可用或可以生成时,使用IL进行预训练和同时学习IL已经成为程序生成环境中普遍使用的方法。