In recent years, vision-language models (VLMs) have shown remarkable performance on visual reasoning tasks (e.g. attributes, location). While such tasks measure the requisite knowledge to ground and reason over a given visual instance, they do not, however, measure the ability of VLMs to retain and generalize such knowledge. In this work, we evaluate their ability to acquire "visible" physical knowledge -- the information that is easily accessible from images of static scenes, particularly across the dimensions of object color, size and space. We build an automatic pipeline to derive a comprehensive knowledge resource for calibrating and probing these models. Our results indicate a severe gap between model and human performance across all three tasks. Furthermore, our caption pretrained baseline (CapBERT) significantly outperforms VLMs on both size and spatial tasks -- highlighting that despite sufficient access to ground language with visual modality, they struggle to retain such knowledge. The dataset and code are available at https://github.com/Axe--/ViPhy .
翻译:近年来,视觉语言模型(VLMS)在视觉推理任务(例如属性、位置)上表现显著,虽然这些任务测量了在特定视觉实例中地面和理性所需的知识,但是它们并没有测量VLMs保留和普及这种知识的能力。在这项工作中,我们评估了它们获得“可见”物理知识的能力,这种知识很容易从静态场景图像中获取,特别是从物体的颜色、大小和空间等方方面面获得。我们建立了一条自动管道,以获得用于校准和检验这些模型的全面知识资源。我们的结果显示,模型和人类业绩在所有三项任务中都存在严重差距。此外,我们的字幕预设基线(CAPBERT)大大超越了VLMs在大小和空间两方面的任务。我们强调,尽管能够以视觉方式充分接触地面语言,但它们很难保留这种知识。数据集和代码可在https://github.com/Axe-/ViPhy查阅。