The canonical approach to video captioning dictates a caption generation model to learn from offline-extracted dense video features. These feature extractors usually operate on video frames sampled at a fixed frame rate and are often trained on image/video understanding tasks, without adaption to video captioning data. In this work, we present SwinBERT, an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description. Instead of leveraging multiple 2D/3D feature extractors, our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input without dedicated design for different frame rates. Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames as opposed to previous successes with sparsely sampled video frames for video-and-language understanding tasks (e.g., video question answering). Moreover, to avoid the inherent redundancy in consecutive video frames, we propose adaptively learning a sparse attention mask and optimizing it for task-specific performance improvement through better long-range video sequence modeling. Through extensive experiments on 5 video captioning datasets, we show that SwinBERT achieves across-the-board performance improvements over previous methods, often by a large margin. The learned sparse attention masks in addition push the limit to new state of the arts, and can be transferred between different video lengths and between different datasets.
翻译:视频字幕的典型方法要求使用字幕生成模型,从线外抽取密集的视频特征中学习。这些特征提取器通常在视频框中以固定框架率抽样操作,并且通常在不适应视频字幕数据的情况下接受图像/视频理解任务的培训。在这项工作中,我们介绍SwinBERT,一个基于端到端的视频字幕变压器模型,以视频框为基础,直接将视频框架补丁作为投入,并输出自然语言描述。我们的方法不是利用多个 2D/3D 特征提取器,而是采用视频变压器,对空间时装示进行编码,可以适应视频输入的长度变异,而不专门设计不同框架率。基于这个模型结构,我们显示视频字幕的可大大获益于更密集的抽样视频框架,而相对于以往视频和语言理解任务(e.g.,视频问题解答)的零星样的视频框架的成功。此外,为了避免连续视频框中固有的冗余,我们提议通过更好的长程视频定位模型,通过我们以往的大型视频排序,在以往的图像排序中进行大规模测试,从而展示以往的大型数据改进。