尽我所能,而不是如我所说:用机器人价格来限定语言 (Do As I Can, Not As I Say: Grounding Language in Robotic Affordances)

Michael Ahn,Anthony Brohan,Noah Brown,Yevgen Chebotar,Omar Cortes,Byron David,Chelsea Finn,Chuyuan Fu,Keerthana Gopalakrishnan,Karol Hausman,Alex Herzog,Daniel Ho,Jasmine Hsu,Julian Ibarz,Brian Ichter,Alex Irpan,Eric Jang,Rosario Jauregui Ruano,Kyle Jeffrey,Sally Jesmonth,Nikhil J Joshi,Ryan Julian,Dmitry Kalashnikov,Yuheng Kuang,Kuang-Huei Lee,Sergey Levine,Yao Lu,Linda Luu,Carolina Parada,Peter Pastor,Jornell Quiambao,Kanishka Rao,Jarek Rettinghouse,Diego Reyes,Pierre Sermanet,Nicolas Sievers,Clayton Tan,Alexander Toshev,Vincent Vanhoucke,Fei Xia,Ted Xiao,Peng Xu,Sichun Xu,Mengyuan Yan,Andy Zeng

from arxiv, See website at https://say-can.github.io/ V1. Initial Upload. V2. Added PaLM results. Added study about new capabilities (drawer manipulation, chain of thought prompting, multilingual instructions). Added an ablation study of language model size. Added an open-source version of \algname on a simulated tabletop environment. Improved readability

Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model's "hands and eyes," while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project's website and the video can be found at https://say-can.github.io/.

翻译：大型语言模型可以将大量关于世界的语义知识编码成丰富的语言模型。这种知识对于机器人来说可能极为有用,以便根据以自然语言表达的高层次、时间上延伸的指示采取行动。但是,语言模型的一个显著弱点是,它们缺乏真实世界的经验,因此难以在特定化体内利用它们来进行决策。例如,要求一种语言模型来描述如何清洁泄漏可能会导致合理的叙述,但是它可能不适用于某个特定机构,例如机器人,因为机器人,需要在一个特定环境中执行这项任务。我们提议通过预先训练的技能提供真实世界的地面技术,用来限制模型提出既可行又符合背景的自然语言行动。机器人可以作为语言模型的“手和眼睛”,而语言模型则提供高层次的语义知识,说明如何将低层次的技能与大型语言模型结合起来,使语言模型能够提供执行复杂和时间性扩展指令的高级知识,同时与这些技能相关的功能提供价值功能来限制模型提出既可行又适合背景的自然语言动作。机器人模型可以作为语言模型的“手和眼睛”的“语言模型”功能模型, 用来将我们找到一个真实的实地操作方法。我们在地面上找到一个特殊的路径, 需要一个真正的移动化的路径。