Image Difference Captioning (IDC) aims at generating sentences to describe the differences between two similar-looking images. The conventional approaches learn captioning models on the offline-extracted visual features and the learning can not be propagated back to the fixed feature extractors pre-trained on image classification datasets. Accordingly, potential improvements can be made by fine-tuning the visual features for: 1) narrowing the gap when generalizing the visual extractor trained on image classification to IDC, and 2) relating the extracted visual features to the descriptions of the corresponding changes. We thus propose CLIP4IDC to transfer a CLIP model for the IDC task to attain these improvements. Different from directly fine-tuning CLIP to generate sentences, a task-specific domain adaptation is used to improve the extracted features. Specifically, the target is to train CLIP on raw pixels to relate the image pairs to the described changes. Afterwards, a vanilla Transformer is trained for IDC on the features extracted by the vision encoder of CLIP. Experiments on three IDC benchmark datasets, CLEVR-Change, Spot-the-Diff and Image-Editing-Request, demonstrate the effectiveness of CLIP4IDC. Our code and models will be released at https://github.com/sushizixin/CLIP4IDC.
翻译:图像差异描述(IDC) 旨在生成描述两种相近图像差异的句子。 常规方法在离线提取的视觉特征上学习字幕模型, 学习的学习无法在图像分类数据集上预先训练的固定特征提取器上传播。 因此, 通过微调视觉特征可以做出潜在的改进:1) 在将图像分类培训的视觉提取器推广到IDC时缩小差距, 2) 将提取的视觉特征与相应的变化描述联系起来。 因此, 我们提议 CLIP4IDC 将 CLIP 模型传输给国际数据中心, 以完成这些改进。 不同于直接微调 CLIP 生成句子的 CLIP, 使用任务特定领域调整来改进提取的特征。 具体而言, 目标是对原始像素进行CLIP 培训, 将图像配对与描述的变化联系起来。 随后, 一个香草变换器就CLIP 的视觉编码所提取的特征为 IDC 。 在三个国际数据中心基准数据集上进行实验, CLEVRR- ID- CHIDRD 和 IMD Rest-DRIG. Sup- LVDR.