Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion (SEARLE), maps the visual features of the reference image into a pseudo-word token in CLIP token embedding space and integrates it with the relative caption. To support research on ZS-CIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR containing multiple ground truths for each query. The experiments show that SEARLE exhibits better performance than the baselines on the two main datasets for CIR tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and the model are publicly available at https://github.com/miccunifi/SEARLE .
翻译:零样本组合图像检索与文本反转
组合图像检索(CIR)旨在基于由参考图像和相关字幕描述两张图片之间的差异的查询来检索目标图像。为CIR标记数据集需要的高投入和成本使得现有方法未能广泛使用,因为它们依赖于监督学习。在本工作中,我们提出一个新任务,称为零样本CIR (ZS-CIR),旨在解决CIR问题,而无需经过标记的训练数据集。我们的方法名为"零样本组合图像检索与文本反转(SEARLE)",将参考图像的视觉特征映射到CLIP token嵌入空间中的伪词标记中,并将其与相关字幕集成在一起。为了支持ZS-CIR的研究,我们介绍了一个名为"上下文中通用的组合图像检索"(CIRCO)的开放域基准数据集,这是一个包含每个查询的多个基准实例的CIR数据集。实验结果表明,SEARLE表现优于基线模型在主要的CIR任务数据集FashionIQ和CIRR以及所提出的CIRCO基准数据集上。该数据集、代码和模型公开于 https://github.com/miccunifi/SEARLE。