TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondences. Besides, due to the substantial gap between modalities, existing methods embed the original modal features into the same latent space for cross-modal alignment. However, feature embedding may lead to intra-modal information distortion. Recently, CLIP has attracted extensive attention from researchers due to its powerful semantic concept learning capacity and rich multi-modal knowledge, which can help us solve the above problems. Accordingly, in the paper, we propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID. To transfer the multi-modal knowledge effectively, we perform fine-grained information excavation to mine intra-modal discriminative clues and inter-modal correspondences. Specifically, we first design a multi-grained global feature learning module to fully mine intra-modal discriminative local information, which can emphasize identity-related discriminative clues by enhancing the interactions between global image (text) and informative local patches (words). Secondly, cross-grained feature refinement (CFR) and fine-grained correspondence discovery (FCD) modules are proposed to establish the cross-grained and fine-grained interactions between modalities, which can filter out non-modality-shared image patches/words and mine cross-modal correspondences from coarse to fine. CFR and FCD are removed during inference to save computational costs. Note that the above process is performed in the original modality space without further feature embedding. Extensive experiments on multiple benchmarks demonstrate the superior performance of our method on TIReID.
翻译:TIReID 的目标是从一组候选图像中检索与给定文本查询相对应的图像。 现有方法利用单一模式的预培训前的先前知识来便利学习,但缺乏多式通信。 此外,由于模式之间存在巨大差距,现有方法将原有模式特征嵌入同一潜伏空间,以便交叉式对齐。 然而, 特征嵌入可能导致内部信息扭曲。 最近, CLIP 吸引了研究人员的广泛关注, 原因是其精密的语义概念学习能力和丰富的跨模式知识, 帮助我们解决上述问题。 因此, 在文件中, 我们提议使用CLIP驱动的单一模式前培训前知识, 以便利学习, 但缺乏多式通信。 此外, 为TIREID充分利用CLIP的强大知识。 为了有效地传输多式知识, 我们进行精密的挖掘信息, 利用内部模式的歧视性线索和跨式通信。 具体来说,我们首先设计了一个多式的、 硬式的全球特征学习模式学习模块模块模块, 用来完全解决内部有差别的本地信息。 更高级的C- cread- cread- cread 和Fread remodition remodal imal imactal imactal ex