Despite recent progress, learning new tasks through language instructions remains an extremely challenging problem. On the ALFRED benchmark for task learning, the published state-of-the-art system only achieves a task success rate of less than 10% in an unseen environment, compared to the human performance of over 90%. To address this issue, this paper takes a closer look at task learning. In a departure from a widely applied end-to-end architecture, we decomposed task learning into three sub-problems: sub-goal planning, scene navigation, and object manipulation; and developed a model HiTUT (stands for Hierarchical Tasks via Unified Transformers) that addresses each sub-problem in a unified manner to learn a hierarchical task structure. On the ALFRED benchmark, HiTUT has achieved the best performance with a remarkably higher generalization ability. In the unseen environment, HiTUT achieves over 160% performance gain in success rate compared to the previous state of the art. The explicit representation of task structures also enables an in-depth understanding of the nature of the problem and the ability of the agent, which provides insight for future benchmark development and evaluation.
翻译:尽管最近取得了进展,但通过语言指导学习新任务仍是一个极具挑战性的问题。在ALFRED任务学习基准方面,已公布的先进系统在无形环境中只能达到不到10%的任务成功率,而人类的绩效则超过90%。为了解决这一问题,本文件更仔细地审视了任务学习。在脱离广泛应用的端到端结构的情况下,我们将任务分解成三个次级问题:次级目标规划、现场导航和物体操纵;并开发了一个HITUT模型(通过统一变换器的等级任务站),以统一的方式解决每一个次级问题,学习等级任务结构。关于ALFRED基准, HITUT已经取得了最佳业绩,其概括能力要高得多。在隐蔽环境中,HATUT取得了超过160 %的成功率,而与以前的情况相比。任务结构的明确表述也使得能够深入了解问题的性质和代理人的能力,为今后的基准发展和评估提供见解。