Given a simple request like Put a washed apple in the kitchen fridge, humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does not yet provide the infrastructure necessary for both reasoning abstractly and executing concretely. We address this limitation by introducing ALFWorld, a simulator that enables agents to learn abstract, text based policies in TextWorld (C\^ot\'e et al., 2018) and then execute goals from the ALFRED benchmark (Shridhar et al., 2020) in a rich visual environment. ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions. In turn, as we demonstrate empirically, this fosters better agent generalization than training only in the visually grounded environment. BUTLER's simple, modular design factors the problem to allow researchers to focus on models for improving every piece of the pipeline (language understanding, planning, navigation, and visual scene understanding).
翻译:在一个简单的要求下,比如在厨房冰箱里放一个清洗过的苹果,人类可以通过想象行动序列,在不动肌肉的情况下,以纯粹抽象的方式理性地解释其成功、典型和效率的可能性。 一旦我们看到了有关的厨房,我们就能够更新我们的抽象计划以适应场景。 Embodied 代理器需要同样的能力,但现有的工作还没有为抽象推理和具体执行提供必要的基础设施。我们通过引入ALFWorld来解决这一局限性,ALFWorld是一个模拟器,使代理商能够学习文本World(Cáot\e et al., 2018)中的抽象、基于文本的政策,然后在一个丰富的视觉环境中执行ALFRED基准(Shridhar et al.,2020)中的目标。ALFWorld能够创建一个新的TUTLER代理器,其抽象知识在文本世界中学习的,直接对应具体、有视觉基础的行动。反过来,我们从经验上证明,这比在视觉环境中的培训更能促进媒介的概括化。TUDLER简单、模块设计因素使研究人员能够专注于改进每一条导航的模型。