We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly - but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models' shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.
翻译:我们提出了一个新任务和数据集,用于评价视觉和语言模型进行语言成文推理的能力,我们称之为Winoground。我们称之为Winoground。有两个图像和两个标题,目标是要正确匹配这两个图像和两个标题,但关键是,这两个标题只以不同的顺序包含完全相同的一套词。数据集由专家旁听员仔细亲手绘制,并贴上一套丰富的精细标记,以协助分析模型性能。我们探索了各种最先进的视觉和语言模型,发现它们中没有一个比机会好得多。很显然,这些模型没有像我们所希望的那样熟练掌握在语言成文推理方面的技能。我们进行了广泛的分析,以了解未来如何努力减轻这些模型的缺陷。我们的目标是Winoground作为有用的评估工具,以推进艺术状态,推动实地的进一步发展。数据集可在https://huggface.co/dataset/facebook/winoground上查阅。