Interaction and navigation defined by natural language instructions in dynamic environments pose significant challenges for neural agents. This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions. We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. To improve training, we leverage synthetic instructions as an intermediate representation that decouples understanding the visual appearance of an environment from the variations of natural language instructions. We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance. Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
翻译:动态环境中由自然语言指令定义的互动和导航对神经剂构成重大挑战。本文件侧重于应对两个挑战:处理长序列子任务,了解复杂的人类指令。我们提议Episodic变异器(E.T.),这是一个将语言投入和全集的视觉观察和行动史编码的多式联运变异器。为了改进培训,我们利用合成指令作为中间代表,将环境的视觉外观与自然语言指令的变异脱钩。我们证明,用变异变异将历史和变异变变变变变变变变对解决组成任务至关重要,以及用合成指令进行预培训和联合培训进一步提高了性能。我们的方法为具有挑战性的ALFRED基准设定了艺术的新状态,实现了38.4%和8.5%的外观和无形测试分裂任务成功率。