Image and text retrieval is one of the foundational tasks in the vision and language domain with multiple real-world applications. State-of-the-art approaches, e.g. CLIP, ALIGN, represent images and texts as dense embeddings and calculate the similarity in the dense embedding space as the matching score. On the other hand, sparse semantic features like bag-of-words models are more interpretable, but believed to suffer from inferior accuracy than dense representations. In this work, we show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations. We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space. Each token in the space is a (sub-)word in the vocabulary, which is not only interpretable but also easy to integrate with existing information retrieval systems. STAIR model significantly outperforms a CLIP model with +$4.9\%$ and +$4.3\%$ absolute Recall@1 improvement on COCO-5k text$\rightarrow$image and image$\rightarrow$text retrieval respectively. It also achieved better performance on both of ImageNet zero-shot and linear probing compared to CLIP.
翻译:图像和文本检索是视觉和语言领域的基本任务之一,具有多种真实世界应用。 最先进的方法,例如 CLIP、 ALIGIN, 将图像和文本作为密集嵌入嵌入器表示, 计算密嵌入空间的相似性, 与匹配评分相匹配。 另一方面, 空格字模型等稀疏的语义特征, 更便于解释, 但被认为比高密度表示法低精度。 在这项工作中, 我们表明, 有可能建立一个稀薄的语义表达式, 其强大甚至甚至比密集演示要强。 我们扩展了 CLIP 模型, 并构建一个稀薄的文本和图像表达式( STAIR ), 将图像和文本绘制到稀少的象征空间。 空间中的每个符号都是词汇中的( 子) 字, 它不仅可以解释,而且容易与现有的信息检索系统整合。 STAIR 模型明显地比CLIP 模型高4.9美元和+4.3美元绝对回调$@1美元。 我们的CO5- Right Restalimalimalimalimal relishall 分别改进了C$C5\ halalalimalimalalal real 。