Designing agents that acquire knowledge autonomously and use it to solve new tasks efficiently is an important challenge in reinforcement learning. Knowledge acquired during an unsupervised pre-training phase is often transferred by fine-tuning neural network weights once rewards are exposed, as is common practice in supervised domains. Given the nature of the reinforcement learning problem, we argue that standard fine-tuning strategies alone are not enough for efficient transfer in challenging domains. We introduce Behavior Transfer (BT), a technique that leverages pre-trained policies for exploration and that is complementary to transferring neural network weights. Our experiments show that, when combined with large-scale pre-training in the absence of rewards, existing intrinsic motivation objectives can lead to the emergence of complex behaviors. These pre-trained policies can then be leveraged by BT to discover better solutions than without pre-training, and combining BT with standard fine-tuning strategies results in additional benefits. The largest gains are generally observed in domains requiring structured exploration, including settings where the behavior of the pre-trained policies is misaligned with the downstream task.
翻译:设计人员自主获取知识并利用知识高效地解决新任务,这是强化学习中的一项重要挑战。在培训前未经监督阶段获得的知识,通常通过微调神经网络重量来转移,一旦奖励暴露,这是受监督领域的常见做法。鉴于强化学习问题的性质,我们争辩说,标准微调战略本身不足以在挑战性领域有效转让知识。我们引入行为转移技术,这种技术利用预先培训的勘探政策,并补充神经网络重量的转移。我们的实验表明,在缺乏奖励的情况下,与大规模培训前培训相结合,现有的内在动机目标可能导致复杂行为的出现。这些预先培训的政策随后可以被BT利用,以找到比没有培训前培训更好的解决方案,并将BT与标准微调战略相结合,带来额外的好处。在需要结构化勘探的领域,包括培训前政策的行为与下游任务不相符的环境下,通常会看到最大的收益。