Latent text representations exhibit geometric regularities, such as the famous analogy: queen is to king what woman is to man. Such structured semantic relations were not demonstrated on image representations. Recent works aiming at bridging this semantic gap embed images and text into a multimodal space, enabling the transfer of text-defined transformations to the image modality. We introduce the SIMAT dataset to evaluate the task of Image Retrieval with Multimodal queries. SIMAT contains 6k images and 18k textual transformation queries that aim at either replacing scene elements or changing pairwise relationships between scene elements. The goal is to retrieve an image consistent with the (source image, text transformation) query. We use an image/text matching oracle (OSCAR) to assess whether the image transformation is successful. The SIMAT dataset will be publicly available. We use SIMAT to evaluate the geometric properties of multimodal embedding spaces trained with an image/text matching objective, like CLIP. We show that vanilla CLIP embeddings are not very well suited to transform images with delta vectors, but that a simple finetuning on the COCO dataset can bring dramatic improvements. We also study whether it is beneficial to leverage pretrained universal sentence encoders (FastText, LASER and LaBSE).
翻译:隐性文本显示显示几何规律性, 比如著名的类比 : Qen是女性的王。 这种结构化的语义关系没有在图像显示中表现出来。 近期旨在将语义差距嵌入图像和文本到多式空间的工程, 能够将文本定义的转换转换转换成图像模式。 我们引入 SIMAT 数据集来评估图像检索校正和多式查询的任务 。 SIMAT 包含 6 k 图像和 18 k 文本转换查询, 目的是替换场景元素或改变场景元素之间的对称关系 。 目标是检索与图像( 源图像、 文本转换) 查询一致的图像 。 我们使用图像/ 文本匹配器( OSCAR) 来评估图像转换成功与否 。 SIMAT 数据集将会被公诸于众 。 我们使用 SIMAT 来评估以图像/ 文本匹配目标培训的多式嵌入空间的几何特性 。 SIMAT 。 我们显示 Vanilla CLIP 嵌入不是非常适合将图像转换成三角矢控器的图像, 但是我们使用一个简单的图像修正了 CDBSER 。 。 。 是否具有 。