Relation-focused cross-modal information retrieval focuses on retrieving information based on relations expressed in user queries, and it is particularly important in information retrieval applications and next-generation search engines. To date, CLIP (Contrastive Language-Image Pre-training) achieved state-of-the-art performance in cross-modal learning tasks due to its efficient learning of visual concepts from natural language supervision. However, CLIP learns visual representations from natural language at a global level without the capability of focusing on image-object relations. This paper proposes a novel CLIP-based network for Relation Reasoning, CLIP-RR, that tackles relation-focused cross-modal information retrieval. The proposed network utilises CLIP to leverage its pre-trained knowledge, and it additionally comprises two main parts: (1) extends the capabilities of CLIP to extract and reason with object relations in images; and (2) aggregates the reasoned results for predicting the similarity scores between images and descriptions. Experiments were carried out by applying the proposed network to relation-focused cross-modal information retrieval tasks on the RefCOCOg, CLEVR, and Flickr30K datasets. The results revealed that the proposed network outperformed various other state-of-the-art networks including CLIP, VSE$\infty$, and VSRN++ on both image-to-text and text-to-image cross-modal information retrieval tasks.
翻译:以关系为重点的跨模式信息检索侧重于根据用户询问中表达的关系检索信息,这在信息检索应用程序和下一代搜索引擎中特别重要。迄今为止,CLIP(培训前语言图像控制)由于高效地从自然语言监督中学习视觉概念,在跨现代学习任务中实现了最先进的业绩。然而,CLIP从自然语言中学习自然语言的视觉表现,而没有能力关注图像-对象的交叉关系。本文提议建立一个基于CLIP的新网络,即基于CLIP的Relational realoging,CLIP-RR,处理以关系为重点的跨模式信息检索。拟议的网络利用CLIP利用其预先培训的知识,还包含两个主要部分:(1)扩展CLIP的能力,提取和解释图像中对象关系;(2)汇总预测图像和描述之间相似性分数的合理结果。通过将拟议的网络应用以关系为重点的跨模式信息检索任务,包括REfCOIP30和Flickral-SE-SE-SE-SE-SE-SE-RA FRAS-R-FRADR-S-R-RAD-R-RAD-R-RVRAD-FRAD-R-S-S-R-R-R-R-R-R-R-R-RVRAVAC-R-R-R-R-R-R-S-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-L-L-L-R-R-R-R-R-R-R-R-R-RVADR-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-R-