Change captioning is to describe the semantic change between a pair of similar images in natural language. It is more challenging than general image captioning, because it requires capturing fine-grained change information while being immune to irrelevant viewpoint changes, and solving syntax ambiguity in change descriptions. In this paper, we propose a neighborhood contrastive transformer to improve the model's perceiving ability for various changes under different scenes and cognition ability for complex syntax structure. Concretely, we first design a neighboring feature aggregating to integrate neighboring context into each feature, which helps quickly locate the inconspicuous changes under the guidance of conspicuous referents. Then, we devise a common feature distilling to compare two images at neighborhood level and extract common properties from each image, so as to learn effective contrastive information between them. Finally, we introduce the explicit dependencies between words to calibrate the transformer decoder, which helps better understand complex syntax structure during training. Extensive experimental results demonstrate that the proposed method achieves the state-of-the-art performance on three public datasets with different change scenarios. The code is available at https://github.com/tuyunbin/NCT.
翻译:更改标题是描述一对自然语言相似图像之间的语义变化。 它比一般图像说明更具挑战性, 因为它需要捕捉细微的更改信息, 同时不受到不相干的观点变化的影响, 并解决变化描述中的语法模糊性。 在本文中, 我们建议使用一个相邻对比变异变异变异器, 以提高模型在不同场景下进行各种变化的感知能力, 以及复杂语法结构的认知能力。 具体地说, 我们首先设计一个相邻的特征, 将周边环境整合到每个特性中, 这有助于快速定位在显性引用指导下发生的不明显变化。 然后, 我们设计了一个共同的特性, 以在附近水平比较两个图像, 从每个图像中提取共同的属性, 以便学习它们之间的有效对比信息。 最后, 我们引入了用于校正变异变异变的词之间的明确依赖性, 这有助于更好地理解培训过程中的复杂语法结构。 广泛的实验结果显示, 拟议的方法在三个公共数据设置的变异场景图中实现了州- 。 代码可在 http:// mbyum/ NS 。</s>