使用预先培训的视觉语言模型进行开放世界物体操纵</s> (Open-World Object Manipulation using Pre-trained Vision-Language Models)

For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary, e.g. "can you get me the pink stuffed whale?" to their sensory observations and actions. This brings up a notably difficult challenge for robots: while robot learning approaches allow robots to learn many different behaviors from first-hand experience, it is impractical for robots to have first-hand experiences that span all of this semantic information. We would like a robot's policy to be able to perceive and pick up the pink stuffed whale, even if it has never seen any data interacting with a stuffed whale before. Fortunately, static data on the internet has vast semantic information, and this information is captured in pre-trained vision-language models. In this paper, we study whether we can interface robot policies with these pre-trained models, with the aim of allowing robots to complete instructions involving object categories that the robot has never seen first-hand. We develop a simple approach, which we call Manipulation of Open-World Objects (MOO), which leverages a pre-trained vision-language model to extract object-identifying information from the language command and image, and conditions the robot policy on the current image, the instruction, and the extracted object information. In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments. In addition, we show how MOO generalizes to other, non-language-based input modalities to specify the object of interest such as finger pointing, and how it can be further extended to enable open-world navigation and manipulation. The project's website and evaluation videos can be found at https://robot-moo.github.io/

翻译：机器人必须能够将人类词汇中丰富的语义信息连接起来, 比如 : “ 你能否给我粉红的鲸鱼? ”?? 他们的感官观察和行动。这给机器人带来了一个特别困难的挑战: 机器人学习方法允许机器人从第一手经验中学习许多不同的行为, 机器人获得涵盖所有这些语义信息的第一手经验是不切实际的。我们希望机器人的政策能够感知和接收粉色的填充鲸, 即使它以前从未看到过任何数据与填充鲸进行互动。幸运的是, 互联网上的静态手指数据拥有广泛的语义信息, 而这种信息在经过预先训练的视觉语言模型中被捕捉到。在本文中,我们研究我们是否能够将机器人政策与这些经过预先训练的模型结合起来, 目的是让机器人完成关于机器人从未见过的方言义类别的指示。我们开发了一个简单的方法, 我们把它称为 Openipulu 对象(MOOO), 来利用一个经过事先训练的天文的天文, 显示一个非语言的天文的天文的天文结构, 显示一个非语言模型, 显示一个真实的天文、游戏的变动的图像和运动的变动图图。</s>