Building a conversational embodied agent to execute real-life tasks has been a long-standing yet quite challenging research goal, as it requires effective human-agent communication, multi-modal understanding, long-range sequential decision making, etc. Traditional symbolic methods have scaling and generalization issues, while end-to-end deep learning models suffer from data scarcity and high task complexity, and are often hard to explain. To benefit from both worlds, we propose a Neuro-Symbolic Commonsense Reasoning (JARVIS) framework for modular, generalizable, and interpretable conversational embodied agents. First, it acquires symbolic representations by prompting large language models (LLMs) for language understanding and sub-goal planning, and by constructing semantic maps from visual observations. Then the symbolic module reasons for sub-goal planning and action generation based on task- and action-level common sense. Extensive experiments on the TEACh dataset validate the efficacy and efficiency of our JARVIS framework, which achieves state-of-the-art (SOTA) results on all three dialog-based embodied tasks, including Execution from Dialog History (EDH), Trajectory from Dialog (TfD), and Two-Agent Task Completion (TATC) (e.g., our method boosts the unseen Success Rate on EDH from 6.1\% to 15.8\%). Moreover, we systematically analyze the essential factors that affect the task performance and also demonstrate the superiority of our method in few-shot settings. Our JARVIS model ranks first in the Alexa Prize SimBot Public Benchmark Challenge.
翻译:建立具有对话内容的代理人来执行现实生活任务是一项长期而相当具有挑战性的研究目标,因为它需要有效的人力代理交流、多模式理解、远距离顺序决策等。 传统的象征性方法具有规模和概括性问题,而端到端深的学习模式则缺乏数据,任务复杂,而且往往难以解释。为了从这两个世界获益,我们提议为模块化、可普及和可解释的理性表达(JARVIS)框架提供一个内优-双曲常识说明框架。 首先,它通过促进用于语言理解和次级目标规划的大型语言模型(LLLMS),并通过从视觉观察中绘制语义图。随后,基于任务和行动层面常识的次级目标规划和行动生成的象征性模块理由。关于TEACh数据集的广泛实验证实了我们JARVIS框架的功效和效率,该框架实现了标准、可普及和可解释的逻辑化(SOSTA)在所有三种基于对话内容的任务上都具有象征意义,包括从 Diallogia-DHI(EDLA)系统化的进度分析方法,以及从我们的第一个阶段的“EDAR-TALULA”的进度分析方法。