Performing simple household tasks based on language directives is very natural to humans, yet it remains an open challenge for an AI agent. Recently, an 'interactive instruction following' task has been proposed to foster research in reasoning over long instruction sequences that requires object interactions in a simulated environment. It involves solving open problems in vision, language and navigation literature at each step. To address this multifaceted problem, we propose a modular architecture that decouples the task into visual perception and action policy, and name it as MOCA, a Modular Object-Centric Approach. We evaluate our method on the ALFRED benchmark and empirically validate that it outperforms prior arts by significant margins in all metrics with good generalization performance (high success rate in unseen environments). Our code is available at https://github.com/gistvision/moca.
翻译:以语言指令为基础的简单家庭任务对于人类来说是非常自然的,但对于一个AI代理来说,这仍然是一个公开的挑战。最近,有人提议了一项“交互式指导跟踪”任务,目的是促进对需要在模拟环境中进行物体互动的长指令序列进行推理的研究。它涉及解决视觉、语言和导航文献方面每个步骤的公开问题。为了解决这一多方面的问题,我们提议了一个模块结构,将任务分解为视觉感知和行动政策,并将其命名为MOCA, 这是一种模块性目标中心方法。我们评估了我们关于ALFRED基准的方法,并经验性地验证了它是否在所有具有良好通用性效果的衡量标准(在看不见环境中高成功率)中显著优于先前艺术。我们的代码可以在 https://github.com/gistvision/moca 上查阅。