MINOTAUR: 多任务视频搜索多模式查询 (MINOTAUR: Multi-task Video Grounding From Multimodal Queries)

Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences. These tasks differ in the type of inputs (only video, or video-query pair where query is an image region or sentence) and outputs (temporal segments or spatio-temporal tubes). However, at their core they require the same fundamental understanding of the video, i.e., the actors and objects in it, their actions and interactions. So far these tasks have been tackled in isolation with individual, highly specialized architectures, which do not exploit the interplay between tasks. In contrast, in this paper, we present a single, unified model for tackling query-based video understanding in long-form videos. In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark which entail queries of three different forms: given an egocentric video and a visual, textual or activity query, the goal is to determine when and where the answer can be seen within the video. Our model design is inspired by recent query-based approaches to spatio-temporal grounding, and contains modality-specific query encoders and task-specific sliding window inference that allow multi-task training with diverse input modalities and different structured outputs. We exhaustively analyze relationships among the tasks and illustrate that cross-task learning leads to improved performance on each individual task, as well as the ability to generalize to unseen tasks, such as zero-shot spatial localization of language queries.

翻译：视频理解任务有多种形式,从行动检测到视觉查询、本地化和时空化,这些任务有多种形式,从行动检测到视觉查询,到视觉查询、行动和互动,从行动查询到视觉查询,这些任务在投入类型(只有视频或视频询问对配对,其中查询是图像区域或句子)和输出(时空段或时空管)和输出(时空段或时空管)上各不相同。然而,在其核心方面,这些任务要求对视频有相同的基本理解,即视频中的行为者和对象、其行动和互动。迄今为止,这些任务是在与不利用任务之间相互作用的单个高度专业化的空间结构中单独处理的。相比之下,在本文件中,我们提出了一个单一的统一模式,用以在长式视频中处理基于询问的视频理解(只有视频,或视频对视频对视频的配对视频的配对对视频的配对)和输出(时空段段段段或时空管管管)和输出(时空管管管管管管管管管管管管)之间,这些任务的核心任务需要三种不同形式的查询:以自我偏重心的视频视频视频和视觉、文字解解答。我们的模型设计是受最近基于查询方法的方法方法的方法的,在窗口定位定位定位定位定位方法方法方法上处理,在远程测测测测测测测测每个的地面任务中,以不同任务,作为不同任务,作为不同任务,作为不同模式分析任务,以不同模式分析任务,作为不同模式分析任务,我们路任务,作为不同模式分析任务,作为不同模式分析任务,作为不同的方式分析任务,作为不同式的流程的流程的流程的流程的流程的流程,作为不同式的流程的流程的流程的流程的流程,作为不同任务,作为不同流程的流程的流程,作为不同流程的流程的流程的流程的流程的流程的流程,作为不同式分析,作为不同式推导,作为不同式分析,作为不同流程的流程的流程的流程的流程的流程的流程,作为不同流程的流程,作为不同流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程,作为不同流程的流程,作为不同流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的