预训练和强化微调的模仿学习用于目标导航 (PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav)

We study ObjectGoal Navigation -- where a virtual robot situated in a new environment is asked to navigate to an object. Prior work has shown that imitation learning (IL) using behavior cloning (BC) on a dataset of human demonstrations achieves promising results. However, this has limitations -- 1) BC policies generalize poorly to new states, since the training mimics actions not their consequences, and 2) collecting demonstrations is expensive. On the other hand, reinforcement learning (RL) is trivially scalable, but requires careful reward engineering to achieve desirable behavior. We present PIRLNav, a two-stage learning scheme for BC pretraining on human demonstrations followed by RL-finetuning. This leads to a policy that achieves a success rate of $65.0\%$ on ObjectNav ($+5.0\%$ absolute over previous state-of-the-art). Using this BC$\rightarrow$RL training recipe, we present a rigorous empirical analysis of design choices. First, we investigate whether human demonstrations can be replaced with `free' (automatically generated) sources of demonstrations, e.g. shortest paths (SP) or task-agnostic frontier exploration (FE) trajectories. We find that BC$\rightarrow$RL on human demonstrations outperforms BC$\rightarrow$RL on SP and FE trajectories, even when controlled for same BC-pretraining success on train, and even on a subset of val episodes where BC-pretraining success favors the SP or FE policies. Next, we study how RL-finetuning performance scales with the size of the BC pretraining dataset. We find that as we increase the size of BC-pretraining dataset and get to high BC accuracies, improvements from RL-finetuning are smaller, and that $90\%$ of the performance of our best BC$\rightarrow$RL policy can be achieved with less than half the number of BC demonstrations. Finally, we analyze failure modes of our ObjectNav policies, and present guidelines for further improving them.

翻译：我们研究了目标导航，即在新环境中，一个虚拟机器人被要求导航到一个物体。先前的工作已经表明，使用基于行为克隆（BC）的人类演示的模仿学习（IL）取得了有前途的结果。然而，这有一些限制——1）基于BC训练出的策略在新状态下泛化能力较差，因为训练模仿的是行为而非其后果，以及2）收集演示是昂贵的。另一方面，强化学习（RL）是可以扩展的，但需要进行精细的奖励工程来实现理想的行为。我们提出了PIRLNav，这是一种两阶段学习方案，先进行基于BC的人类演示预训练，然后进行强化微调。这会导致一个策略，在ObjectNav上实现了65.0％的成功率（比以前最先进的技术增加了5.0％）。使用此BC→RL训练配方，我们对设计选择进行了严格的经验分析。首先，我们研究了人类演示是否可以用“免费的”（自动生成的）演示替换，例如最短路径（SP）或任务无关的边界探索（FE）轨迹。我们发现，使用人类演示的BC→RL优于基于SP和FE轨迹的BC→RL，即使在控制相同BC预训练的训练成功的情况下，甚至在BC预训练成功倾向于SP或FE策略的val episode子集上也是如此。接下来，我们研究了RL微调性能如何随着BC预训练数据集的增长而变化。我们发现，随着我们增加BC预训练数据集的大小并获得较高的BC准确度，来自RL微调的改进越小，我们最佳的BC→RL策略的90％可以获得不到一半的BC演示。最后，我们分析了我们的ObjectNav策略的失效模式，并提供了进一步改进它们的指南。