Image Difference Captioning (IDC) aims at generating sentences to describe differences between two similar-looking images. Conventional approaches learn an IDC model with a pre-trained and usually frozen visual feature extractor. Accordingly, two major issues may arise: (1) a large domain gap usually exists between the pre-training datasets used for training such a visual encoder and that of the downstream IDC task, and (2) the visual feature extractor, when separately encoding two images, often does not effectively encode the visual changes between two images. Due to the excellent zero-shot performance of the recently proposed CLIP, we thus propose CLIP4IDC to transfer a CLIP model for the IDC task to address those issues. Different from directly fine-tuning CLIP to generate sentences, we introduce an adaptation training process to adapt CLIP's visual encoder to capture and align differences in image pairs based on the textual descriptions. Experiments on three IDC benchmark datasets, CLEVR-Change, Spot-the-Diff, and Image-Editing-Request, demonstrate the effectiveness of CLIP4IDC.
翻译:图像差异描述(IDC)旨在生成描述两种相近图像差异的句子。常规方法学习了IDC模型,具有预先训练的通常冻结的视觉特征提取器。因此,可能出现两个主要问题:(1)培训前数据集与下游IDC任务的培训前数据集之间通常存在巨大的领域差距,以及(2)视觉特征提取器,当单独编码两张图像时,往往无法有效地将两种图像之间的视觉变化编码起来。由于最近提议的CLIP的出色零射性表现,我们因此建议CLIP4IDC为IDC任务传输CLIP模型,以解决这些问题。不同于直接微调 CLIP生成句子,我们引入了适应培训程序,以调整CLIP的视觉编码,以根据文字描述捕捉和校准图像配对的差异。关于国际数据中心三个基准数据集(CLEVR-Change, Spoint-Diff)的实验,我们因此建议CLIP4IC4ID-Dimdiction-Sective)的实验,展示了CLIP4C4C的实效。