Existing multi-style image captioning methods show promising results in generating a caption with accurate visual content and desired linguistic style. However, existing methods overlook the relationship between linguistic style and visual content. To overcome this drawback, we propose style-aware contrastive learning for multi-style image captioning. First, we present a style-aware visual encoder with contrastive learning to mine potential visual content relevant to style. Moreover, we propose a style-aware triplet contrast objective to distinguish whether the image, style and caption matched. To provide positive and negative samples for contrastive learning, we present three retrieval schemes: object-based retrieval, RoI-based retrieval and triplet-based retrieval, and design a dynamic trade-off function to calculate retrieval scores. Experimental results demonstrate that our approach achieves state-of-the-art performance. In addition, we conduct an extensive analysis to verify the effectiveness of our method.
翻译:现有的多式图像字幕方法显示,在制作带有准确视觉内容和理想语言风格的字幕方面,取得了大有希望的成果。然而,现有方法忽略了语言风格和视觉内容之间的关系。为了克服这一缺陷,我们提议为多式图像字幕提供有风格的对比学习。首先,我们提出了一个有风格的视觉编码器,与与与风格相关的潜在视觉内容进行对比学习。此外,我们提出一个有风格的三重对比目标,以区分图像、风格和字幕是否匹配。为对比性学习提供正面和负面的样本,我们提出了三种检索方案:基于对象的检索、基于RoI的检索和基于三重基检索,以及设计一个动态的交换功能以计算检索分数。实验结果表明,我们的方法达到了最新水平的性能。此外,我们进行了广泛的分析,以核实我们方法的有效性。