In Composed Image Retrieval (CIR), a user combines a query image with text to describe their intended target. Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image. Labeling such triplets is expensive and hinders broad applicability of CIR. In this work, we propose to study an important task, Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training. To this end, we propose a novel method, called Pic2Word, that requires only weakly labeled image-caption pairs and unlabeled image datasets to train. Unlike existing supervised CIR models, our model trained on weakly labeled or unlabeled datasets shows strong generalization across diverse ZS-CIR tasks, e.g., attribute editing, object composition, and domain conversion. Our approach outperforms several supervised CIR methods on the common CIR benchmark, CIRR and Fashion-IQ. Code will be made publicly available at https://github.com/google-research/composed_image_retrieval.
翻译:在合成图像检索( CIR) 中, 用户将查询图像与文本组合在一起, 以描述其预定目标。 现有方法依赖于使用由查询图像、 文本规格和目标图像组成的标签三重图像对CIR模型进行监管的学习。 标签三重图像费用昂贵,妨碍了CIR的广泛适用性。 在这项工作中, 我们提议研究一项重要任务, 零 - 热合成图像合成图像检索( ZS- CIR), 目标是构建 CIR 模型, 而无需使用标签三重培训。 为此, 我们提出了一种新型方法, 叫做 Pic2Word,, 只需使用贴有下标签的图像描述配对和未贴标签的图像数据集来培训。 与现有的受监管的 CIR 模型不同, 我们在标签薄弱或未贴标签的数据集上训练的模型显示, 各种ZS- CIR 任务( 例如, 属性编辑、 对象构成和域域转换) 。 我们的方法在通用的 CIR 基准、 CIRRRRR_ Fashimon- Q 上, 将使用若干受监管的 CASim- compeat 。