With the explosive growth of web videos and emerging large-scale vision-language pre-training models, e.g., CLIP, retrieving videos of interest with text instructions has attracted increasing attention. A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities in specific granularities for semantic correspondence. Unfortunately, the intrinsic uncertainties of optimal entity combinations in appropriate granularities for cross-modal queries are understudied, which is especially critical for modalities with hierarchical semantics, e.g., video, text, etc. In this paper, we propose an Uncertainty-Adaptive Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure. Concretely, we add additional learnable tokens in the encoders to adaptively aggregate multi-grained semantics for flexible high-level reasoning. In the refined embedding space, we represent text-video pairs as probabilistic distributions where prototypes are sampled for matching evaluation. Comprehensive experiments on four benchmarks justify the superiority of our UATVR, which achieves new state-of-the-art results on MSR-VTT (50.8%), VATEX (64.5%), MSVD (49.7%), and DiDeMo (45.8%). The code is available in supplementary materials and will be released publicly soon.
翻译:随着网络视频的爆炸性增长和新出现的大型视觉语言预培训模式,例如CLIP的爆炸性增长和新出现的大型视觉语言预培训模式,例如CLIP、以文字指示检索感兴趣的视频,这些模式引起了越来越多的关注。一种常见的做法是将文本视频配对转移到嵌入空间的同一嵌入空间,以及与某些实体在特定微粒中进行跨语义通信的手工艺的跨模式互动。不幸的是,对最佳实体在适合跨模式查询的适当微粒中组合的内在不确定性没有得到充分研究,这对等级语义学模式(例如视频、文本等)尤其至关重要。在本文件中,我们提议采用“不确定性-Adabitial Text-Video Retreival”方法,称为“UATVR”方法,该方法以每次外观作为分布匹配程序的一种模式。具体地说,我们在编码中添加了额外的可学习符号,以适应性综合的多等级语义语义语义,用于灵活的高层次推理。在精细的嵌入空间中,我们将文本配对文本的配对准性分布进行比较性分配,其中,其中的原型号是公开抽样的R。DVR50,全面测试的优势是四基准,在四个基准上实现我们现有的VTUTUTAY优势。