Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text describing the desired changes. Existing approaches to solving this challenging task learn a mapping from the (reference image, modification text)-pair to an image embedding that is then matched against a large image corpus. One area that has not yet been explored is the reverse direction, which asks the question, what reference image when modified as describe by the text would produce the given target image? In this work we propose a bi-directional training scheme that leverages such reversed queries and can be applied to existing composed image retrieval architectures. To encode the bi-directional query we prepend a learnable token to the modification text that designates the direction of the query and then finetune the parameters of the text embedding module. We make no other changes to the network architecture. Experiments on two standard datasets show that our novel approach achieves improved performance over a baseline BLIP-based model that itself already achieves state-of-the-art performance.
翻译:组合图像检索根据由参考图像和修改文本组成的多模式用户查询搜索目标图像。现有方法学习从(参考图像,修改文本)配对到图像嵌入的映射,然后将其与大型图像库进行匹配。尚未探索的一个领域是反向查询,询问如下问题:在描述的条件下,哪个参考图像会生成给定的目标图像?在这项工作中,我们提出了一种双向训练方案,利用这种反向查询,可以应用于现有的组合图像检索架构。为了编码双向查询,我们将一个可学习的标记放在修改文本之前,指定查询的方向,然后微调文本嵌入模块的参数。我们不对网络架构进行其他更改。在两个标准数据集上进行的实验表明,我们的新方法比基线BLIP模型实现了更好的性能,而BLIP模型本身已经实现了最先进的性能。