This letter introduces ERRA, an embodied learning architecture that enables robots to jointly obtain three fundamental capabilities (reasoning, planning, and interaction) for solving long-horizon language-conditioned manipulation tasks. ERRA is based on tightly-coupled probabilistic inferences at two granularity levels. Coarse-resolution inference is formulated as sequence generation through a large language model, which infers action language from natural language instruction and environment state. The robot then zooms to the fine-resolution inference part to perform the concrete action corresponding to the action language. Fine-resolution inference is constructed as a Markov decision process, which takes action language and environmental sensing as observations and outputs the action. The results of action execution in environments provide feedback for subsequent coarse-resolution reasoning. Such coarse-to-fine inference allows the robot to decompose and achieve long-horizon tasks interactively. In extensive experiments, we show that ERRA can complete various long-horizon manipulation tasks specified by abstract language instructions. We also demonstrate successful generalization to the novel but similar natural language instructions.
翻译:本文介绍了ERRA,一种具身学习架构,使机器人可以联合获得三种基本能力(推理、规划和交互),以解决长时间跨度语言驱动的操作任务。ERRA基于两个颗粒度级别上紧密耦合的概率推理。粗粒度推理公式化为序列生成,通过大型语言模型从自然语言指令和环境状态中推断操作语言。然后,机器人缩放到精细颗粒度推理部分,执行对应于行动语言的具体操作。精细颗粒度推理被构建为一个马尔科夫决策过程,它以动作语言和环境感应为观察,并输出动作。在环境中执行操作的结果为随后的粗粒度推理提供反馈。这种从粗粒度到精细颗粒度的推理允许机器人交互式地分解和实现长时间跨度的任务。在广泛的实验中,我们展示了ERRA可以完成各种由抽象语言指令指定的长时间跨度操作任务。我们还展示了成功推广到新的但相似的自然语言指令的能力。