Modern image captioning models are usually trained with text similarity objectives. However, since reference captions in public datasets often describe the most salient common objects, models trained with text similarity objectives tend to ignore specific and detailed aspects of an image that distinguish it from others. Toward more descriptive and distinctive caption generation, we propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function. We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation. This completely eliminates the need for reference captions during the reward computation. To comprehensively evaluate descriptive captions, we introduce FineCapEval, a new dataset for caption evaluation with fine-grained criteria: overall, background, object, relations. In our experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model. We also show that our unsupervised grammar finetuning of the CLIP text encoder alleviates the degeneration problem of the naive CLIP reward. Lastly, we show human analysis where the annotators strongly prefer the CLIP reward to the CIDEr and MLE objectives according to various criteria. Code and Data: https://github.com/j-min/CLIP-Caption-Reward
翻译:现代图像字幕模型通常经过文本相似性目标的培训。然而,由于公共数据集中的参考标题往往描述最突出的常见对象,因此,经过文本相似性目标培训的模型往往忽略了某图像的具体和详细方面,从而将其与其它不同。为了进行更描述性和更独特的字幕生成,我们提议使用CLIP,即一个在网上巨大的图像文本配对方面受过培训的多式联运编码器,来计算多式联运相似性,并将其作为奖励功能。我们还提议了CLIP文本编码的简单微调战略,以改进不需要额外文本注释的语法。这完全消除了在奖赏计算过程中对参考说明的需要。为了全面评估描述性说明性说明性说明,我们采用了FineCapEval,这是一个用于标题评估的新数据集,其标准是:总体、背景、对象、关系。在我们关于文本到图像检索的实验中,拟议的CLIP-制导出模型产生比CIDer-opimimic 模型更独特的字幕。我们还表明,我们未经校准的CLIP-C-Pronologyal 调整CLIP 和CLIP Restalalalal 分析。