基于技能的元加强学习 (Skill-based Meta-Reinforcement Learning)

While deep reinforcement learning methods have shown impressive results in robot learning, their sample inefficiency makes the learning of complex, long-horizon behaviors with real robot systems infeasible. To mitigate this issue, meta-reinforcement learning methods aim to enable fast learning on novel tasks by learning how to learn. Yet, the application has been limited to short-horizon tasks with dense rewards. To enable learning long-horizon behaviors, recent works have explored leveraging prior experience in the form of offline datasets without reward or task annotations. While these approaches yield improved sample efficiency, millions of interactions with environments are still required to solve complex tasks. In this work, we devise a method that enables meta-learning on long-horizon, sparse-reward tasks, allowing us to solve unseen target tasks with orders of magnitude fewer environment interactions. Our core idea is to leverage prior experience extracted from offline datasets during meta-learning. Specifically, we propose to (1) extract reusable skills and a skill prior from offline datasets, (2) meta-train a high-level policy that learns to efficiently compose learned skills into long-horizon behaviors, and (3) rapidly adapt the meta-trained policy to solve an unseen target task. Experimental results on continuous control tasks in navigation and manipulation demonstrate that the proposed method can efficiently solve long-horizon novel target tasks by combining the strengths of meta-learning and the usage of offline datasets, while prior approaches in RL, meta-RL, and multi-task RL require substantially more environment interactions to solve the tasks.

翻译：深度强化学习方法在机器人学习方面显示了令人印象深刻的成果,而其抽样效率低却使得无法以真正的机器人系统来学习复杂、长视距的行为。为了缓解这一问题,元加强学习方法的目的是通过学习如何学习,使新任务能够快速学习。然而,应用仅限于短视距线任务,并带来密集的回报。为了能够学习长视距行为,最近的工作探索了以离线数据集的形式利用以往的经验,而没有奖赏或任务说明。虽然这些方法提高了抽样效率,但仍需要数百万与环境互动才能解决复杂的任务。在这项工作中,我们设计了一种方法,使得能够通过长视线、稀疏的多轨任务进行元化学习,让我们能够以环境互动程度更少的顺序解决新任务。我们的核心思想是利用从离线数据集中提取的以往经验。具体地说,我们提议(1) 从离线数据集中提取可重复的技能和新技能,(2) 元培训一种高层次的政策,学会如何在长视线上高效率地进行元化学习,同时,在远视轨任务中,需要将所拟的内流化任务和不断修正的校程任务中,在远程任务中,以快速化任务中,需要通过不断的校程化任务中,在长期操作中,用新的任务中,用新的任务中,用新的任务,可以学习。