Most existing image retrieval systems use text queries as a way for the user to express what they are looking for. However, fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is. The text modality can only cumbersomely express such localization preferences, whereas pointing is a more natural fit. In this paper, we propose an image retrieval setup with a new form of multimodal queries, where the user simultaneously uses both spoken natural language (the what) and mouse traces over an empty canvas (the where) to express the characteristics of the desired target image. We then describe simple modifications to an existing image retrieval model, enabling it to operate in this setup. Qualitative and quantitative experiments show that our model effectively takes this spatial guidance into account, and provides significantly more accurate retrieval results compared to text-only equivalent systems.
翻译:大多数现有的图像检索系统都使用文字查询作为用户表达自己所要查找的内容的一种方式。 但是,精细的图像检索往往要求有能力在图像中同时表达他们所要查找的内容所在位置。 文本模式只能繁琐地表达这种本地化偏好, 而指出则更自然。 在本文中, 我们提出一个图像检索设置, 使用一种新形式的多式查询, 用户同时使用语言自然语言( what) 和鼠标在空画布上( 哪里) 的痕迹来表达想要的目标图像的特性。 我们然后描述对现有图像检索模型的简单修改, 使其能够在这个设置中运行 。 定性和定量实验显示, 我们的模型有效地考虑到这种空间指导, 并提供比文本对应系统更准确得多的检索结果 。