SwinBERT: 视频字幕引人注意的端到端转换器 (SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning)

The canonical approach to video captioning dictates a caption generation model to learn from offline-extracted dense video features. These feature extractors usually operate on video frames sampled at a fixed frame rate and are often trained on image/video understanding tasks, without adaption to video captioning data. In this work, we present SwinBERT, an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description. Instead of leveraging multiple 2D/3D feature extractors, our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input without dedicated design for different frame rates. Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames as opposed to previous successes with sparsely sampled video frames for video-and-language understanding tasks (e.g., video question answering). Moreover, to avoid the inherent redundancy in consecutive video frames, we propose adaptively learning a sparse attention mask and optimizing it for task-specific performance improvement through better long-range video sequence modeling. Through extensive experiments on 5 video captioning datasets, we show that SwinBERT achieves across-the-board performance improvements over previous methods, often by a large margin. The learned sparse attention masks in addition push the limit to new state of the arts, and can be transferred between different video lengths and between different datasets. Code is available at https://github.com/microsoft/SwinBERT

翻译：视频字幕的典型方式要求使用一个标题生成模型,从线外提取的密集视频特征中学习。这些特征提取器通常在视频框中以固定框架率抽样操作,并且通常在不适应视频字幕数据的情况下接受图像/视频理解任务的培训。在这项工作中,我们介绍SwinBERT,一个基于端到端的视频字幕变压器模型,它直接将视频框架补丁作为投入,并输出自然语言描述。我们的方法不是利用多个 2D/3D 功能提取器,而是采用一种视频变异器,用于对空间时空显示进行编码,在不专门设计不同框架率的情况下,可以适应视频输入的长度变异。基于这个模型结构,我们显示视频字幕的可大大受益于更密集的抽样视频框架,而不是以往视频和语言理解任务(e.g.,视频问题解答)的视频框架。此外,为了避免连续视频框中固有的冗余,我们提议通过更远的视频定位面罩进行适应性学习,并优化它用于任务特定的性能改进。我们经常通过远程的视频排序,通过大型视频模型测试,在以往的变换数据中,可以实现大的变换的变式模型,通过我们现有的数据变换的变换的变换的变换数据。