The canonical approach to video captioning dictates a caption generation model to learn from offline-extracted dense video features. These feature extractors usually operate on video frames sampled at a fixed frame rate and are often trained on image/video understanding tasks, without adaption to video captioning data. In this work, we present SwinBERT, an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description. Instead of leveraging multiple 2D/3D feature extractors, our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input without dedicated design for different frame rates. Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames as opposed to previous successes with sparsely sampled video frames for video-and-language understanding tasks (e.g., video question answering). Moreover, to avoid the inherent redundancy in consecutive video frames, we propose adaptively learning a sparse attention mask and optimizing it for task-specific performance improvement through better long-range video sequence modeling. Through extensive experiments on 5 video captioning datasets, we show that SwinBERT achieves across-the-board performance improvements over previous methods, often by a large margin. The learned sparse attention masks in addition push the limit to new state of the arts, and can be transferred between different video lengths and between different datasets.


翻译:视频字幕的典型方法要求使用字幕生成模型,从线外抽取密集的视频特征中学习。这些特征提取器通常在视频框中以固定框架率抽样操作,并且通常在不适应视频字幕数据的情况下接受图像/视频理解任务的培训。在这项工作中,我们介绍SwinBERT,一个基于端到端的视频字幕变压器模型,以视频框为基础,直接将视频框架补丁作为投入,并输出自然语言描述。我们的方法不是利用多个 2D/3D 特征提取器,而是采用视频变压器,对空间时装示进行编码,可以适应视频输入的长度变异,而不专门设计不同框架率。基于这个模型结构,我们显示视频字幕的可大大获益于更密集的抽样视频框架,而相对于以往视频和语言理解任务(e.g.,视频问题解答)的零星样的视频框架的成功。此外,为了避免连续视频框中固有的冗余,我们提议通过更好的长程视频定位模型,通过我们以往的大型视频排序,在以往的图像排序中进行大规模测试,从而展示以往的大型数据改进。

1
下载
关闭预览

相关内容

视频描述生成(Video Caption),就是从视频中自动生成一段描述性文字

知识荟萃

精品入门和进阶教程、论文和代码整理等

更多

查看相关VIP内容、论文、资讯等
多标签学习的新趋势(2020 Survey)
专知会员服务
43+阅读 · 2020年12月6日
专知会员服务
110+阅读 · 2020年3月12日
Stabilizing Transformers for Reinforcement Learning
专知会员服务
60+阅读 · 2019年10月17日
最新BERT相关论文清单,BERT-related Papers
专知会员服务
53+阅读 · 2019年9月29日
文本+视觉,多篇 Visual/Video BERT 论文介绍
AI科技评论
22+阅读 · 2019年8月30日
BERT/Transformer/迁移学习NLP资源大列表
专知
19+阅读 · 2019年6月9日
Hierarchically Structured Meta-learning
CreateAMind
27+阅读 · 2019年5月22日
Unsupervised Learning via Meta-Learning
CreateAMind
42+阅读 · 2019年1月3日
Arxiv
5+阅读 · 2018年3月30日
VIP会员
Top
微信扫码咨询专知VIP会员