Large Language Models (LLMs) handle physical commonsense information inadequately. As a result of being trained in a disembodied setting, LLMs often fail to predict an action's outcome in a given environment. However, predicting the effects of an action before it is executed is crucial in planning, where coherent sequences of actions are often needed to achieve a goal. Therefore, we introduce the multi-modal task of predicting the outcomes of actions solely from realistic sensory inputs (images and text). Next, we extend an LLM to model latent representations of objects to better predict action outcomes in an environment. We show that multi-modal models can capture physical commonsense when augmented with visual information. Finally, we evaluate our model's performance on novel actions and objects and find that combining modalities help models to generalize and learn physical commonsense reasoning better.
翻译:大型语言模型(LLMS)处理的有形常识信息不够充分。由于在空洞的环境中接受培训,LLMS往往无法预测特定环境中的行动结果。然而,预测行动执行前的行动效果对于规划来说至关重要,在规划中,往往需要一致的行动顺序来实现目标。因此,我们引入了仅从现实的感官投入(图像和文字)预测行动结果的多模式任务。接下来,我们将LMM扩大到对物体的模拟潜在表现,以便更好地预测环境中的行动结果。我们显示,多模式模型在增加视觉信息时,可以捕捉到物理常识。最后,我们评估了我们模型在新行动和物体上的绩效,并发现将模式结合起来有助于将物理常识推理归纳和学习得更好。