PIRLNav: ObjectNav 的模拟和RL 微调前训练 (PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav)

We study ObjectGoal Navigation - where a virtual robot situated in a new environment is asked to navigate to an object. Prior work has shown that imitation learning (IL) on a dataset of human demonstrations achieves promising results. However, this has limitations $-$ 1) IL policies generalize poorly to new states, since the training mimics actions not their consequences, and 2) collecting demonstrations is expensive. On the other hand, reinforcement learning (RL) is trivially scalable, but requires careful reward engineering to achieve desirable behavior. We present a two-stage learning scheme for IL pretraining on human demonstrations followed by RL-finetuning. This leads to a PIRLNav policy that advances the state-of-the-art on ObjectNav from $60.0\%$ success rate to $65.0\%$ ($+5.0\%$ absolute). Using this IL$\rightarrow$RL training recipe, we present a rigorous empirical analysis of design choices. First, we investigate whether human demonstrations can be replaced with `free' (automatically generated) sources of demonstrations, e.g. shortest paths (SP) or task-agnostic frontier exploration (FE) trajectories. We find that IL$\rightarrow$RL on human demonstrations outperforms IL$\rightarrow$RL on SP and FE trajectories, even when controlled for the same IL-pretraining success on TRAIN, and even on a subset of VAL episodes where IL-pretraining success favors the SP or FE policies. Next, we study how RL-finetuning performance scales with the size of the IL pretraining dataset. We find that as we increase the size of the IL-pretraining dataset and get to high IL accuracies, the improvements from RL-finetuning are smaller, and that $90\%$ of the performance of our best IL$\rightarrow$RL policy can be achieved with less than half the number of IL demonstrations. Finally, we analyze failure modes of our ObjectNav policies, and present guidelines for further improving them.

翻译：我们研究目标导航—— 在一个位于新环境中的虚拟机器人被要求导航到一个对象。先前的工作已经显示, 模拟人类演示数据集的学习( IL) 将取得令人乐观的结果。但是, 这限制了 $ 1 1 美元, IL 政策普遍向新州低化, 因为培训模拟了没有其后果的行动, 2 收集演示是昂贵的。另一方面, 强化学习( RL) 是微不足道的可缩放的, 但需要仔细奖励工程来达到理想行为。我们为 IL 提供了两阶段的人类演示预培训计划, 并随后进行 RL 调整。这导致一个 PIRLNav 政策, 将OblusNav 的状态提升到 60.0 美元美元成功率( +5. 0 美元绝对美元 ) 。使用这个 IML$ ( Rrightrow ) 培训食谱, 我们对设计选择进行严格的实证分析。首先, 我们调查人类演示能否用“ 免费” ( 自动产生) 演示源, 例如, 最短路迹演示( 最短路段 SP) ) 或最短的 RL 或最短的 RLL 数据数据的SLL 的演示程演示政策, 我们发现 ILL 的的的的的的的运行的运行的成绩的运行程的成绩的成绩的性能越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越越。