We introduce the task of open-vocabulary visual instance search (OVIS). Given an arbitrary textual search query, Open-vocabulary Visual Instance Search (OVIS) aims to return a ranked list of visual instances, i.e., image patches, that satisfies the search intent from an image database. The term "open vocabulary" means that there are neither restrictions to the visual instance to be searched nor restrictions to the word that can be used to compose the textual search query. We propose to address such a search challenge via visual-semantic aligned representation learning (ViSA). ViSA leverages massive image-caption pairs as weak image-level (not instance-level) supervision to learn a rich cross-modal semantic space where the representations of visual instances (not images) and those of textual queries are aligned, thus allowing us to measure the similarities between any visual instance and an arbitrary textual query. To evaluate the performance of ViSA, we build two datasets named OVIS40 and OVIS1600 and also introduce a pipeline for error analysis. Through extensive experiments on the two datasets, we demonstrate ViSA's ability to search for visual instances in images not available during training given a wide range of textual queries including those composed of uncommon words. Experimental results show that ViSA achieves an mAP@50 of 21.9% on OVIS40 under the most challenging setting and achieves an mAP@6 of 14.9% on OVIS1600 dataset.
翻译:我们引入了开放式词汇视觉实例搜索任务( OVIS ) 。 根据任意的文本搜索查询, 开放语言视觉实例搜索( OVIS ) 旨在从图像数据库返回符合搜索意图的视觉实例排名列表, 即图像补丁, 满足图像数据库的搜索意图 。 “ 开放词汇” 意味着对视觉实例没有限制, 也没有限制用于构建文本搜索查询的词。 我们提议通过视觉- 语义一致演示学习( VSA ) 来应对这样的搜索挑战。 VISA 将大型图像显示配对作为薄弱的图像级别( 不是实例级别) 督导, 以学习一个丰富的跨模式语义空间, 其中显示视觉实例( 不是图像) 和文字查询内容一致。 因此, 允许我们测量任何视觉实例和任意文本查询之间的相似性。 为了评估 VISA 的性能, 我们用名为 OVIS- 940 和 OVIS 1600 建立两个数据集, 并引入一个错误分析管道。 通过在两种图像显示的广度实验中进行广泛的 VISA 能力, 显示这些在21 范围的图像测试中, 显示这些测试中, 显示为21 能力 显示这些测试中, 在21 测试中, 我们的图像范围中, 显示这些测试中, 显示这些测试中, 显示这些测试中, 显示这些图像范围中显示这些图像的图像的图像的功能中显示这些测试中, 。