Distinctive Image Captioning (DIC) -- generating distinctive captions that describe the unique details of a target image -- has received considerable attention over the last few years. A recent DIC work proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims to make the generated captions can tell apart the target and reference images. Unfortunately, reference images used by existing Ref-DIC works are easy to distinguish: these reference images only resemble the target image at scene-level and have few common objects, such that a Ref-DIC model can trivially generate distinctive captions even without considering the reference images. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we first propose two new Ref-DIC benchmarks. Specifically, we design a two-stage matching mechanism, which strictly controls the similarity between the target and reference images at object-/attribute- level (vs. scene-level). Secondly, to generate distinctive captions, we develop a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC. It not only extracts visual features from the target image, but also encodes the differences between objects in the target and reference images. Finally, for more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of the generated captions. Experimental results demonstrate that our TransDIC can generate distinctive captions. Besides, it outperforms several state-of-the-art models on the two new benchmarks over different metrics.
翻译:显性图像显示(DIC) -- -- 生成描述目标图像独特细节的独特字幕 -- -- 在过去几年里受到相当重视。最近DIC的工作提议通过将目标图像与一组语义相似的参考图像(即基于参考的 DIC (Ref-DIC))比较来生成独特的字幕。它旨在使生成的字幕能够区分目标和参考图像。不幸的是,现有的 Ref-DIC 工作所使用的参考图像很容易区分:这些参考图像仅像现场一级的目标图像,而且没有共同的物体,因此Ref-DIC 模型即使不考虑参考图像,也可以轻而易举地生成独特的字幕。为了确保 Ref-DI C 模型真正看到目标图像中的独特对象(或属性),我们首先提议两个新的 Ref-DIC 基准。具体地说,我们设计一个两阶段匹配机制,严格控制目标与参考图像之间的相似性,在目标-/归属一级(vs. 场景一级) 。第二,为了生成清晰的缩略性标题,我们开发一个强大的缩略图,我们只是将目标的缩略图的缩略图解的缩略图解的缩图。