We extend the task of composed image retrieval, where an input query consists of an image and short textual description of how to modify the image. Existing methods have only been applied to non-complex images within narrow domains, such as fashion products, thereby limiting the scope of study on in-depth visual reasoning in rich image and language contexts. To address this issue, we collect the Compose Image Retrieval on Real-life images (CIRR) dataset, which consists of over 36,000 pairs of crowd-sourced, open-domain images with human-generated modifying text. To extend current methods to the open-domain, we propose CIRPLANT, a transformer based model that leverages rich pre-trained vision-and-language (V&L) knowledge for modifying visual features conditioned on natural language. Retrieval is then done by nearest neighbor lookup on the modified features. We demonstrate that with a relatively simple architecture, CIRPLANT outperforms existing methods on open-domain images, while matching state-of-the-art accuracy on the existing narrow datasets, such as fashion. Together with the release of CIRR, we believe this work will inspire further research on composed image retrieval.
翻译:我们扩展了包含图像检索的任务, 输入查询包括图像和如何修改图像的简短文字描述。 现有方法仅适用于狭小域内的非复杂图像, 如时尚产品, 从而限制在丰富图像和语言背景下深入视觉推理的研究范围 。 为了解决这个问题, 我们收集真实生活图像数据集的合成图像检索( CIRRR) 。 该数据集由36,000多对来自众源的开放域图像和人为生成的修改文本组成。 为了将当前方法扩展至开放域, 我们建议使用基于变异器的模型CIRPLANT, 一种基于变异器的模型, 利用经过预先训练的丰富视觉和语言知识来修改以自然语言为条件的视觉特征。 然后由最近的邻居对修改后的特征进行检索。 我们用一个相对简单的结构来显示, CIRPLANT 将现有的开放域图像方法与人类生成的修改文本相匹配, 同时匹配现有狭小数据集的准确性, 例如时尚。 与 CIRR 的图像的发布一起, 我们相信, 我们将会进一步复制 。