Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.
翻译:大型语言模型在一系列复杂的任务中非常出色。然而,在现实世界中,例如机器人问题,能够进行总体推论,这就提出了基础化的挑战。我们提出体现语言模型,将现实世界连续传感器模式直接纳入语言模型,从而在文字和概念之间建立联系。对我们的体现语言模型的投入是多种模式的句子,这些句子可以互换视觉、连续的状态估计和文字输入编码。我们结合预先培训的大型语言模型,培训这些编码的端对端,用于包含多种内容的任务,包括连续机器人操纵规划、视觉问题回答和字幕。我们的评估显示,PALM-E,一个单一大型的体现多式联运模型,能够从各种观察模式、多功能和进一步展示积极的转移:模式的好处是各种互联网规模语言、视觉和视觉语言领域的多种联合培训。我们最大的模型,即PALM-E-562B,有562B参数,除了接受机器人任务培训外,还有视觉语言通用学,具有不断提高的状态和保留能力。</s>