Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have begun to apply transformer to video processing. This survey aims to give a comprehensive overview on transformer-based pre-training methods for Video-Language learning. We first briefly introduce the transformer tructure as the background knowledge, including attention mechanism, position encoding etc. We then describe the typical paradigm of pre-training & fine-tuning on Video-Language processing in terms of proxy tasks, downstream tasks and commonly used video datasets. Next, we categorize transformer models into Single-Stream and Multi-Stream structures, highlight their innovations and compare their performances. Finally, we analyze and discuss the current challenges and possible future research directions for Video-Language pre-training.
翻译:在基于变压器的自然语言任务和进一步计算机愿景任务培训前方法的成功激励下,研究人员已开始将变压器应用于视频处理,这次调查的目的是全面概述基于变压器的视频语言学习培训前方法,我们首先简短地介绍变压器结构,作为背景知识,包括关注机制、位置编码等。然后,我们从代理任务、下游任务和常用视频数据集的角度描述视频语言处理培训前和微调的典型模式。接下来,我们将变压器模型分类为单层和多层结构,突出其创新并比较其业绩。最后,我们分析和讨论视频语言培训前的当前挑战和可能的未来研究方向。