Video-language pre-training is crucial for learning powerful multi-modal representation. However, it typically requires a massive amount of computation. In this paper, we develop SMAUG, an efficient pre-training framework for video-language models. The foundation component in SMAUG is masked autoencoders. Different from prior works which only mask textual inputs, our masking strategy considers both visual and textual modalities, providing a better cross-modal alignment and saving more pre-training costs. On top of that, we introduce a space-time token sparsification module, which leverages context information to further select only "important" spatial regions and temporal frames for pre-training. Coupling all these designs allows our method to enjoy both competitive performances on text-to-video retrieval and video question answering tasks, and much less pre-training costs by 1.9X or more. For example, our SMAUG only needs about 50 NVIDIA A6000 GPU hours for pre-training to attain competitive performances on these two video-language tasks across six popular benchmarks.
翻译:视频预培训对于学习强大的多模式代表至关重要。 但是, 它通常需要大量计算。 在本文中, 我们开发了SMAUG, 一个高效的视频模式预培训框架。 SMAUG 的基础部分是隐藏自动校对器。 不同于以前只掩盖文本投入的工程, 我们的掩码战略既考虑视觉和文字模式, 提供更好的跨模式调整, 并节省更多的培训前费用。 此外, 我们引入了一个空间时间象征性的扩增模块, 利用背景信息进一步选择“ 重要” 空间区域和预培训时间框架。 所有这些设计结合在一起, 我们的方法都使我们既可以在文本到视频的检索和视频回答任务上享有竞争性的表演, 也能够享受1. 9X 或以上 更少的培训前成本。 例如, 我们的SMAUG 只需要大约50 NVIDIA A600 GPU小时的预培训时间, 才能在六个流行基准的视频语言任务上获得竞争性的表演。