争取快速适应多频道视频-语言视频-语言检索快速变通的多频道视频视频-语言检索前竞争模式 (Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval)

Multi-channel video-language retrieval require models to understand information from different channels (e.g. video$+$question, video$+$speech) to correctly link a video with a textual response or query. Fortunately, contrastive multimodal models have been shown to be highly effective at aligning entities in images/videos and text, e.g., CLIP; text contrastive models have been extensively studied recently for their strong ability of producing discriminative sentence embeddings, e.g., SimCSE. Their abilities are exactly needed by multi-channel video-language retrieval. However, there is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources. In this paper, we identify a principled model design space with two axes: how to represent videos and how to fuse video and text information. Based on categorization of recent methods, we investigate the options of representing videos using continuous feature vectors or discrete text tokens; for the fusion method, we explore the use of a multimodal transformer or a pretrained contrastive text model. We extensively evaluate the four combinations on five video-language datasets. We surprisingly find that discrete text tokens coupled with a pretrained contrastive text model yields the best performance, which can even outperform state-of-the-art on the iVQA and How2QA datasets without the additional training on millions of video-language data. Further analysis shows that this is because representing videos as text tokens captures the key visual information with text tokens that are naturally aligned with text models and the text models are strong multimodal retriever after the contrastive pretraining process.

翻译：多通道视频检索要求模型理解不同频道的信息(例如,视频$+问题,视频$+美元,视频$+speech),以便正确将视频与文本回应或查询连接起来。幸运的是,对比式多式联运模型在图像/视频和文本实体(例如,CLIP)的匹配上显示非常有效;最近对文本对比模型进行了广泛研究,以了解其制作歧视性句嵌入的强大能力,例如,SimCSE。它们的能力正是多频道视频语言检索所需要的。然而,没有清晰的方法将这两行快速调整到多频道视频语言检索中,但数据和资源有限。幸运的是,我们用两个轴轴来确定一个原则性模型设计空间:如何代表视频以及如何整合视频和文本信息。根据对最新方法的分类,我们用连续功能矢量矢量矢量矢量矢量矢量的矢量视频,我们探索如何使用多式变压器或预加节度的对比文本模型。我们广泛评估了这四个组合模式的版本文本,因为我们在5级版本的文本中找到了一个令人惊讶的版本。