An intuitive way to search for images is to use queries composed of an example image and a complementary text. While the first provides rich and implicit context for the search, the latter explicitly calls for new traits, or specifies how some elements of the example image should be changed to retrieve the desired target image. Current approaches typically combine the features of each of the two elements of the query into a single representation, which can then be compared to the ones of the potential target images. Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval. Taking inspiration from them, we exploit the specific relation of each query element with the targeted image and derive light-weight attention mechanisms which enable to mediate between the two complementary modalities. We validate our approach on several retrieval benchmarks, querying with images and their associated free-form text modifiers. Our method obtains state-of-the-art results without resorting to side information, multi-level features, heavy pre-training nor large architectures as in previous works.
翻译:搜索图像的直觉方法是使用由示例图像和补充文本组成的查询。虽然第一个查询为搜索提供了丰富和隐含的背景,但后者明确要求有新的特性,或具体说明如何修改示例图像的某些要素以检索预期的目标图像。目前的方法通常将查询的两个要素的每个特点合并成一个单一的表达式,然后可以将其与潜在目标图像的特征进行比较。我们的工作旨在通过两个熟悉和相关框架的棱镜来显示任务的新光线:文本到图像和图像到图像的检索。从中汲取灵感,我们利用每个查询要素与目标图像的具体关系,并形成轻度关注机制,使两种互补模式之间能够进行介质。我们验证了我们在若干检索基准上的做法,与图像及其相关的自由格式文本修改器进行查询。我们的方法在不使用前一作品中的侧信息、多层次特征、重型前期培训或大型结构的情况下获得了最新结果。