We propose Embodied Navigation Trajectory Learner (ENTL), a method for extracting long sequence representations for embodied navigation. Our approach unifies world modeling, localization and imitation learning into a single sequence prediction task. We train our model using vector-quantized predictions of future states conditioned on current states and actions. ENTL's generic architecture enables the sharing of the the spatio-temporal sequence encoder for multiple challenging embodied tasks. We achieve competitive performance on navigation tasks using significantly less data than strong baselines while performing auxiliary tasks such as localization and future frame prediction (a proxy for world modeling). A key property of our approach is that the model is pre-trained without any explicit reward signal, which makes the resulting model generalizable to multiple tasks and environments.
翻译:我们提出了一种名为嵌入式导航轨迹学习器(ENTL)的方法,用于提取体现式导航的长序列表示。我们的方法将世界建模,定位和模仿学习统一为单一的序列预测任务。我们使用基于当前状态和动作条件的未来状态的矢量量化预测来训练模型。ENTL的通用架构使其在多个具有挑战性的体现式任务中可以共享空间时间序列编码器。尽管执行辅助任务,如本地化和未来帧预测(世界建模的代理),但我们使用比强大基线更少的数据来执行导航任务时实现了具有竞争力的性能。我们方法的一个关键属性是在没有任何明确的奖励信号的情况下预训练模型,这使得得到的模型可推广到多个任务和环境中。