Recent advances in language and vision push forward the research of captioning a single image to describing visual differences between image pairs. Suppose there are two images, I_1 and I_2, and the task is to generate a description W_{1,2} comparing them, existing methods directly model { I_1, I_2 } -> W_{1,2} mapping without the semantic understanding of individuals. In this paper, we introduce a Learning-to-Compare (L2C) model, which learns to understand the semantic structures of these two images and compare them while learning to describe each one. We demonstrate that L2C benefits from a comparison between explicit semantic representations and single-image captions, and generalizes better on the new testing image pairs. It outperforms the baseline on both automatic evaluation and human evaluation for the Birds-to-Words dataset.
翻译:语言和愿景方面的最新进展推进了将单一图像标题描述成描述图像配对之间视觉差异的研究。 假设有两种图像, I_ 1 和 I_ 2, 任务在于生成一个描述 W ⁇ 1, 2} 比较这些图像, 现有的直接模型 {I_ 1, I_ 2} - > W ⁇ 1, 2} 绘图方法没有个人的语义理解。 在本文中, 我们引入了一种“ 学习到复合” (L2C) 模型, 学会理解这两张图像的语义结构, 并在学习描述每张图像的同时进行比较。 我们证明 L2C 得益于对明确的语义表达和单象标题的比较, 并且对新的测试图像配对的概括性更好。 它超越了鸟类到 Words 数据集的自动评估和人文评估基准 。