VIOLET: 带有蒙面视觉模拟模型的端到端视频语言变形器 (VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling)

A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. Recent studies try to mitigate this disconnection via end-to-end training. To make it computationally feasible, prior works tend to "imagify" video inputs, i.e., a handful of sparsely sampled frames are fed into a 2D CNN, followed by a simple mean-pooling or concatenation to obtain the overall video representations. Although achieving promising results, such simple approaches may lose temporal information that is essential for performing downstream VidL tasks. In this work, we present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs. Further, unlike previous studies that found pre-training tasks on video inputs (e.g., masked frame modeling) not very effective, we design a new pre-training task, Masked Visual-token Modeling (MVM), for better video modeling. Specifically, the original video frame patches are "tokenized" into discrete visual tokens, and the goal is to recover the original visual tokens based on the masked patches. Comprehensive analysis demonstrates the effectiveness of both explicit temporal modeling via video transformer and MVM. As a result, VIOLET achieves new state-of-the-art performance on 5 video question answering tasks and 4 text-to-video retrieval tasks.

翻译：视频语言( VidL) 建模方面的巨大挑战在于从图像/视频理解模型和下游VidL数据中提取的固定视频显示器与下游VidL数据脱钩。最近的研究试图通过端到端培训来减少这种脱节。为了在计算上可行, 先前的作品倾向于“ 量化” 视频输入, 也就是说, 少数少许抽样框被输入到一个 2D CNN 上, 其次是简单的平均集合或聚合, 以获得总体视频显示器。尽管取得了有希望的成果, 但这种简单的方法可能会丢失执行下游 VidL 任务所必需的时间信息。在这项工作中, 我们展示VIOLET, 一个完全端到端的VIdeO- LanguagE 变换器, 采用一个视频变压器来明确模拟视频输入的时间动态。此外, 与以往的研究发现, 培训前任务( 例如, 蒙面的框架建模模型) 并不十分有效, 我们设计一个新的培训前状态, 蒙蔽的视觉模型(MVMM) 任务, 用于更好的视频模型建模。具体地, 将原的视频图像框架的图像图像图像图像图像图像图像图像转换为“ ”, 和直路面的图像路面的图像路面结果分析。