We propose Embodied Navigation Trajectory Learner (ENTL), a method for extracting long sequence representations for embodied navigation. Our approach unifies world modeling, localization and imitation learning into a single sequence prediction task. We train our model using vector-quantized predictions of future states conditioned on current states and actions. ENTL's generic architecture enables sharing of the spatio-temporal sequence encoder for multiple challenging embodied tasks. We achieve competitive performance on navigation tasks using significantly less data than strong baselines while performing auxiliary tasks such as localization and future frame prediction (a proxy for world modeling). A key property of our approach is that the model is pre-trained without any explicit reward signal, which makes the resulting model generalizable to multiple tasks and environments.
翻译:我们提出了一种名为ENTL(Embodied Navigation Trajectory Learner)的方法,用于为体验式导航提取长序列表示。 我们的方法将世界建模,定位和模仿学习统一为单一的序列预测任务。我们使用基于当前状态和动作的预测未来状态的矢量量化来训练模型。 ENTL的通用架构使其能够共享空间 - 时间序列编码器以处理多个具有挑战性的体验式任务。我们在比强基线更少的数据下实现竞争性表现,并执行辅助任务,例如定位和未来帧预测(作为对世界建模的代理)。我们方法的一个关键属性是,模型在没有任何明确的奖励信号的情况下进行预训练,这使得得到的模型能够适用于多个任务和环境。