Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropogated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens and latent visual features. We conduct comprehensive experiments and provide insights into the factors leading to effective MVM training, resulting in an enhanced model VIOLETv2. Empirically, we show VIOLETv2 pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.
翻译:虽然在视频(VidL)培训前探索了视频投入的类似重建目标(例如蒙面框架模型),但先前的研究未能找到真正有效的MVM战略,无法大大有利于下游业绩。在这项工作中,我们系统地审查VidL学习背景下MV的潜力。具体地说,我们的研究基于完全端至端的VIdeO-LanguagE变异器(VIOLET),从MVM培训的监管可回馈到视频像素空间。总共探索了MVM的八个不同的重建目标,从低级像素价值和定向梯度到高层深度地图、光流、离散视觉符号和潜在视觉特征。我们进行全面实验,并深入了解导致有效的MVM培训的因素,从而强化了VOLETVERV2模型。我们展示了ViOLETVIV2前经过培训的ViOLETVAV-RVAM目标,实现了13MVVA-RM目标的显著改进。我们展示了VIVOLET-R-RVVA-R-RVVVIC-RAL Ral 目标基准,我们通过视频-Risal-Risal-R-L 目标的显著改进。