The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language. The major challenges of this task lie in two aspects: 1) fine-grained visual differences that require learning stronger vision and language association and 2) high-cost of manual annotations that leads to limited supervised data. To address these challenges, we propose a new modeling framework following the pre-training-finetuning paradigm. Specifically, we design three self-supervised tasks and contrastive learning strategies to align visual differences and text descriptions at a fine-grained level. Moreover, we propose a data expansion strategy to utilize extra cross-task supervision information, such as data for fine-grained image classification, to alleviate the limitation of available supervised IDC data. Extensive experiments on two IDC benchmark datasets, CLEVR-Change and Birds-to-Words, demonstrate the effectiveness of the proposed modeling framework. The codes and models will be released at https://github.com/yaolinli/IDC.
翻译:图像差异说明(IDC)任务旨在描述两种与自然语言相近的图像之间的视觉差异,这项任务的主要挑战在于两个方面:(1) 需要学习更强的视觉和语言联系的细微视觉差异,(2) 需要学习更强的视觉差异和语言联系,以及(2) 需要高成本的人工说明,从而导致有限的监督数据;为应对这些挑战,我们提议了一个以培训前的调整模式为基础的新的模型框架。具体地说,我们设计了三项自我监督的任务和对比学习战略,以在细微的鉴别水平上将视觉差异和文本描述相匹配。此外,我们提议了一项数据扩展战略,利用额外的跨任务监督信息,如精细图像分类的数据,以缓解现有国际数据中心监管数据的局限性。关于两个国际数据中心基准数据集CLEVR-Change和Birds-to-Words的广泛实验,展示了拟议模型框架的有效性。代码和模型将在https://github.com/yaolinli/IDC上发布。