对监督跨模式跨模式检索的愿景-语言预培训模式的全面经验研究 (A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval)

Cross-Modal Retrieval (CMR) is an important research topic across multimodal computing and information retrieval, which takes one type of data as the query to retrieve relevant data of another type. It has been widely used in many real-world applications. Recently, the vision-language pre-trained models represented by CLIP demonstrate its superiority in learning the visual and textual representations and gain impressive performance on various vision and language related tasks. Although CLIP as well as the previous pre-trained models have shown great performance improvement in the unsupervised CMR, the performance and impact of these pre-trained models on the supervised CMR were rarely explored due to the lack of common representation for the multimodal class-level associations. In this paper, we take CLIP as the current representative vision-language pre-trained model to conduct a comprehensive empirical study. We evaluate its performance and impact on the supervised CMR, and attempt to answer several key research questions. To this end, we first propose a novel model CLIP4CMR (CLIP enhanced network for Cross-Modal Retrieval) that employs the pre-trained CLIP as backbone network to perform the supervised CMR. Then by means of the CLIP4CMR framework, we revisit the design of different learning objectives in current CMR methods to provide new insights on model design. Moreover, we investigate the most concerned aspects in applying CMR, including the robustness to modality imbalance and sensitivity to hyper-parameters, to provide new perspectives for practical applications. Through extensive experiments, we show that CLIP4CMR achieves the SOTA results with prominent improvements on the benchmark datasets, and can be used as a fundamental framework to empirically study the key research issues of the supervised CMR, with significant implications for model design and practical considerations.

翻译：跨模式检索是多式联运计算和信息检索中的一个重要研究课题,它使用一种数据作为检索另一种类型相关数据的查询,在很多现实世界应用中广泛使用。最近,CLIP所代表的愿景语言预培训模型展示了在学习视觉和文字表现方面的优势,并在各种视觉和语言相关任务方面取得了令人印象深刻的成绩。虽然CLIP以及先前经过培训的模型显示,在未监督的CMR方面,这些经过事先培训的模型的绩效和影响在未监督的CMR方面有很大的改进,但由于多式联运类协会缺乏共同的代表性,很少探讨这些在受监督的CMR方面的模型的绩效和影响。在本文件中,我们把CLIP作为当前有代表性的愿景语言预培训模型,开展全面的经验研究。我们评估其业绩和对监督的CMR的影响,并试图回答几个关键的研究问题。为此,我们首先建议采用新的CLIP4模型(CIP为跨模式改进了CMRR的经验框架),利用CIP前的CLIP应用,将MRMR作为当前基本设计基准框架,然后将C作为我们进行核心研究的工具。