多查询视频检索 (Multi-query Video Retrieval)

Retrieving target videos based on text descriptions is a task of great practical value and has received increasing attention over the past few years. In this paper, we focus on the less-studied setting of multi-query video retrieval, where multiple queries are provided to the model for searching over the video archive. We first show that the multi-query retrieval task is more pragmatic and representative of real-world use cases and better evaluates retrieval capabilities of current models, thereby deserving of further investigation alongside the more prevalent single-query retrieval setup. We then propose several new methods for leveraging multiple queries at training time to improve over simply combining similarity outputs of multiple queries from regular single-query trained models. Our models consistently outperform several competitive baselines over three different datasets. For instance, Recall@1 can be improved by 4.7 points on MSR-VTT, 4.1 points on MSVD and 11.7 points on VATEX over a strong baseline built on the state-of-the-art CLIP4Clip model. We believe further modeling efforts will bring new insights to this direction and spark new systems that perform better in real-world video retrieval applications. Code is available at https://github.com/princetonvisualai/MQVR.

翻译：根据文字描述检索目标视频是一项非常实际的工作,在过去几年中,我们越来越关注基于文本描述的一项任务。在本文中,我们注重研究较少的多查询视频检索的多查询设置,为视频档案搜索模型提供多种查询。我们首先显示,多查询检索任务更加务实,更能代表现实世界使用的案件,并更好地评价当前模型的检索能力,因此值得与更为普遍的单一查询检索设置一道进行进一步调查。然后我们提出几种新的方法,在培训时间利用多个查询,以改进仅将常规单一查询培训模型的多个查询的类似产出合并在一起的简单做法。我们的模型始终超越了三个不同数据集的若干竞争性基线。例如,在MSR-VTT、MSVD的4.1点和VATEX的11.7点上,可以在基于最新艺术CLIP4Clip 模型的强基线上进行改进。我们认为进一步的建模工作将给这个方向带来新的洞察力,并激发新的系统,在真实世界/Rbasimalimalimalimaling Q。

相关内容

MoDELS

关注 30

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

专知会员服务

8+阅读 · 2022年3月19日

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

专知会员服务

28+阅读 · 2022年3月6日

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

36+阅读 · 2022年1月24日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

44+阅读 · 2020年10月31日