Image captioning models are usually trained according to human annotated ground-truth captions, which could generate accurate but generic captions. In this paper, we focus on generating the distinctive captions that can distinguish the target image from other similar images. To evaluate the distinctiveness of captions, we introduce a series of metrics that use large-scale vision-language pre-training model CLIP to quantify the distinctiveness. To further improve the distinctiveness of captioning models, we propose a simple and effective training strategy which trains the model by comparing target image with similar image group and optimizing the group embedding gap. Extensive experiments are conducted on various baseline models to demonstrate the wide applicability of our strategy and the consistency of metric results with human evaluation. By comparing the performance of our best model with existing state-of-the-art models, we claim that our model achieves new state-of-the-art towards distinctiveness objective.
翻译:图像字幕模型通常根据人类附加说明的地面真实说明进行训练,这些说明可以产生准确但通用的字幕。 在本文中,我们侧重于制作能够将目标图像与其他类似图像区分开来的独特字幕。为了评估字幕的独特性,我们推出了一系列衡量尺度,使用大规模视觉语言预科培训模型CLIP来量化其独特性。为了进一步提高字幕模型的独特性,我们提出了一个简单有效的培训战略,通过将目标图像与类似图像组进行比较和优化组合嵌入差距来培训模型。对各种基线模型进行了广泛的实验,以展示我们的战略的广泛适用性以及衡量结果与人类评估的一致性。通过将我们最佳模型的性能与现有最新模型进行比较,我们声称我们的模型实现了与独特性目标的新状态。