State of the art reinforcement learning has enabled training agents on tasks of ever increasing complexity. However, the current paradigm tends to favor training agents from scratch on every new task or on collections of tasks with a view towards generalizing to novel task configurations. The former suffers from poor data efficiency while the latter is difficult when test tasks are out-of-distribution. Agents that can effectively transfer their knowledge about the world pose a potential solution to these issues. In this paper, we investigate transfer learning in the context of model-based agents. Specifically, we aim to understand when exactly environment models have an advantage and why. We find that a model-based approach outperforms controlled model-free baselines for transfer learning. Through ablations, we show that both the policy and dynamics model learnt through exploration matter for successful transfer. We demonstrate our results across three domains which vary in their requirements for transfer: in-distribution procedural (Crafter), in-distribution identical (RoboDesk), and out-of-distribution (Meta-World). Our results show that intrinsic exploration combined with environment models present a viable direction towards agents that are self-supervised and able to generalize to novel reward functions.
翻译:艺术强化学习的现状使得培训人员能够完成日益复杂的任务。然而,目前的模式倾向于从零开始,从每个新任务或收集任务,转向推广新任务配置。前者数据效率低下,而后者在测试任务不能分配时困难重重。能够有效转让其世界知识的代理人为这些问题提供了潜在的解决办法。在本文件中,我们从基于模型的代理人的角度调查转让学习情况。具体地说,我们的目标是了解何时确切的环境模型具有优势和原因。我们发现,基于模型的方法优于控制的无模型的转移学习基线。我们通过推理,显示通过成功转让的勘探项目所学到的政策和动态模型都是成功的。我们展示了在转让要求方面各不相同的三个领域的结果:分配程序(Crafter)、分配相同(RoboDesk)和分配外分配(Meta-World)。我们的结果显示,内在的探索与环境模型相结合,为自我监控和能够普及创新奖励功能的代理人提供了一个可行的方向。