Our commonsense knowledge about objects includes their typical visual attributes; we know that bananas are typically yellow or green, and not purple. Text and image corpora, being subject to reporting bias, represent this world-knowledge to varying degrees of faithfulness. In this paper, we investigate to what degree unimodal (language-only) and multimodal (image and language) models capture a broad range of visually salient attributes. To that end, we create the Visual Commonsense Tests (ViComTe) dataset covering 5 property types (color, shape, material, size, and visual co-occurrence) for over 5000 subjects. We validate this dataset by showing that our grounded color data correlates much better than ungrounded text-only data with crowdsourced color judgments provided by Paik et al. (2021). We then use our dataset to evaluate pretrained unimodal models and multimodal models. Our results indicate that multimodal models better reconstruct attribute distributions, but are still subject to reporting bias. Moreover, increasing model size does not enhance performance, suggesting that the key to visual commonsense lies in the data.
翻译:关于物体的常识知识包括其典型的视觉属性; 我们知道香蕉通常是黄色或绿色的,而不是紫色的。 文本和图像公司, 受到报告偏差的影响, 在不同程度上的忠诚度代表着这种世界知识。 在本文中, 我们调查了各种单式( 语言) 和多式( 图像和语言) 模型在多大程度上包含广泛的视觉显著属性。 为此, 我们创建了视觉常识测试( ViComte) 数据集, 涵盖5000多个主题的5种属性类型( 颜色、 形状、 材料、 大小 和 视觉共生 ) 。 我们验证了这个数据集, 我们通过显示我们的底色数据比 Paik 等人 ( 2021 ) 提供的无源文本数据更贴近于无源的文本数据。 我们然后使用我们的数据集来评估预先训练过的单式模型和多式模型。 我们的结果表明, 多式联运模型可以更好地重建属性分布, 但仍有可能报告偏差。 此外, 增加模型规模并不能提高性能提高性, 表明视觉常识度的关键存在于数据中 。