Modern image captioning models are usually trained with text similarity objectives. However, since reference captions in public datasets often describe the most salient common objects, models trained with text similarity objectives tend to ignore specific and detailed aspects of an image that distinguish it from others. Toward more descriptive and distinctive caption generation, we propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function. We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation. This completely eliminates the need for reference captions during the reward computation. To comprehensively evaluate descriptive captions, we introduce FineCapEval, a new dataset for caption evaluation with fine-grained criteria: overall, background, object, relations. In our experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model. We also show that our unsupervised grammar finetuning of the CLIP text encoder alleviates the degeneration problem of the naive CLIP reward. Lastly, we show human analysis where the annotators strongly prefer the CLIP reward to the CIDEr and MLE objectives according to various criteria. Code and Data: https://github.com/j-min/CLIP-Caption-Reward
翻译:现代图像字幕生成模型通常使用文本相似度目标进行训练。然而,由于公共数据集中的参考字幕通常描述最显著的常见对象,使用文本相似度目标训练的模型往往忽略了区分它与其他图像的具体细节。为了更具描述性和独特性地生成字幕,我们建议使用CLIP,一个通过网络学习大量图像-文本对的多模态编码器,计算多模态相似度并将其用作奖励函数。我们还提出了一种简单的CLIP文本编码器微调策略,用于改进语法而不需要额外的文本注释。这完全消除了奖励计算期间参考字幕的需要。为了全面评估描述性字幕,我们介绍了FineCapEval,一个用于根据细粒度标准评估字幕的新数据集:总体、背景、对象、关系。在我们的文本到图像检索和FineCapEval实验中,所提出的基于CLIP引导的模型生成的字幕比以CIDEr为优化目标的模型更具独特性。我们还展示了对CLIP文本编码器进行无监督语法微调可以缓解原始CLIP奖励的退化问题。最后,我们展示了人类分析结果,根据不同的标准,注释者强烈偏好CLIP奖励而不是CIDEr和MLE目标。 代码和数据:https://github.com/j-min/CLIP-Caption-Reward