Image-Text matching (ITM) is a common task for evaluating the quality of Vision and Language (VL) models. However, existing ITM benchmarks have a significant limitation. They have many missing correspondences, originating from the data construction process itself. For example, a caption is only matched with one image although the caption can be matched with other similar images and vice versa. To correct the massive false negatives, we construct the Extended COCO Validation (ECCV) Caption dataset by supplying the missing associations with machine and human annotators. We employ five state-of-the-art ITM models with diverse properties for our annotation process. Our dataset provides x3.6 positive image-to-caption associations and x8.5 caption-to-image associations compared to the original MS-COCO. We also propose to use an informative ranking-based metric mAP@R, rather than the popular Recall@K (R@K). We re-evaluate the existing 25 VL models on existing and proposed benchmarks. Our findings are that the existing benchmarks, such as COCO 1K R@K, COCO 5K R@K, CxC R@1 are highly correlated with each other, while the rankings change when we shift to the ECCV mAP@R. Lastly, we delve into the effect of the bias introduced by the choice of machine annotator. Source code and dataset are available at https://github.com/naver-ai/eccv-caption
翻译:图像- 文本匹配( ITM) 是评估视觉和语言模型质量的共同任务 。 但是, 现有的 ITM 基准有相当大的限制 。 它们有许多缺失的对应信息, 来源于数据构建过程本身。 例如, 标题只匹配一个图像, 虽然标题可以与其他类似图像匹配, 反之亦然。 为了纠正大规模虚假的负值, 我们通过向缺失的关联方提供机器和人文识别器, 构建扩展COCO校验( ECCV) 数据集 。 我们为我们的批注过程使用五种具有不同属性的先进ITM 模型。 我们的数据集提供了x3. 6 正面图像到图像协会和x8. 标题到图像协会, 与原始 MS- CO 相比。 我们还提议使用基于信息排序的 mAP@R, 而不是流行的Recall@K。 我们重新评价现有的25 VL 模型, 现有和拟议的基准。 我们的发现是现有的基准, 如CO 1K@ K, CO 5K 、 CO 5K@ K, 向 Cx- 变换的 Cx- 和 我们最后的 RC 的 RcServeal 的 Rx 。