Recent visuolinguistic pre-trained models show promising progress on various end tasks such as image retrieval and video captioning. Yet, they fail miserably on the recently proposed Winoground dataset, which challenges models to match paired images and English captions, with items constructed to overlap lexically but differ in meaning (e.g., "there is a mug in some grass" vs. "there is some grass in a mug"). By annotating the dataset using new fine-grained tags, we show that solving the Winoground task requires not just compositional language understanding, but a host of other abilities like commonsense reasoning or locating small, out-of-focus objects in low-resolution images. In this paper, we identify the dataset's main challenges through a suite of experiments on related tasks (probing task, image retrieval task), data augmentation, and manual inspection of the dataset. Our analysis suggests that a main challenge in visuolinguistic models may lie in fusing visual and textual representations, rather than in compositional language understanding. We release our annotation and code at https://github.com/ajd12342/why-winoground-hard .
翻译:最近经过培训的引文前模型表明,在图像检索和视频字幕等各种最终任务上取得了大有希望的进展。然而,这些模型在近期提议的Winoground数据集上却错误地失败了,该数据集挑战了将相配图像和英文字幕相匹配的模型,其设计项目在逻辑上是重叠的,但含义上却有所不同(例如,“一些草地上有一个杯子”和“杯子中有一些草”)。我们的分析表明,通过使用新的细微标记来说明数据集,解决Winoground任务不仅需要形成语言的理解,还需要一系列其他能力,如普通思维推理或低分辨率图像中定位小型、重点以外的对象。我们在本文件中通过相关任务(调查任务、图像检索任务)、数据增强和手工检查等一系列实验,确定了数据集的主要挑战。我们的分析表明,用反语言模型的主要挑战可能在于采用视觉和文字表达方式,而不是形成语言理解。我们在 httpshard/giv42/wormagroundob.com上发布了我们的注释和代码。