The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. Our source code and trained models are publicly available at: https://github.com/aimagelab/pacscore.
翻译:最近,CLIP模型在各种跨模式任务中被证明非常有效,其中包括从视觉和语言体系结构生成的字幕的评估。 在本文中,我们提出了一种用于图像标注的基于对比的新型评估指标的配方,即正增强对比学习分数(PAC-S)。该指标以一种新颖的方式,通过在策划数据上添加生成的图像和文本来统一对比视觉语义空间的学习。涵盖多个数据集的实验证明,我们的新指标在图像和视频上与人类判断具有最高的相关性,超过了现有的基于参考的指标,如CIDEr和SPICE和基于参考的指标,如CLIP-Score。最后,我们测试了所提出的度量在考虑流行的图像标注方法时的系统级相关性,并评估了使用不同跨模态特征的影响。我们的源代码和训练模型可在以下网址公开获取:https://github.com/aimagelab/pacscore。