VILT:没有革命或地区监督的愿景和语言变异器 (ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision) - 专知论文

会员服务 ·

0

变换 · 卷积 · Processing（编程语言） · INTERACT · Performer ·

2021 年 6 月 10 日

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

翻译：VILT:没有革命或地区监督的愿景和语言变异器

Wonjae Kim,Bokyung Son,Ildoo Kim

from arxiv, ICML 2021 Long Presentation

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches to VLP heavily rely on image feature extraction processes, most of which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more computation than the multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary. In this paper, we present a minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the sense that the processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to tens of times faster than previous VLP models, yet with competitive or better downstream task performance. Our code and pre-trained weights are available at https://github.com/dandelin/vilt.

翻译：视觉和语言预科培训(VLP)改进了各种共同视觉和语言下游任务的业绩。目前对VLP的处理办法在很大程度上依赖于图像特征提取过程,其中多数涉及区域监督(例如物体探测)和革命结构(例如ResNet ) 。尽管在文献中被忽视,但我们发现在以下两方面都存在问题:(1) 效率/速度,即仅仅提取输入特征所需要的计算量远远多于多式互动步骤;(2) 表达力,因为它与视觉嵌入器及其预先定义的视觉词汇的表达力高度相连。在本文件中,我们提出了一个最小的VLP模型、视觉和语言变异器(VILT),其单一性意味着视觉投入的处理被大大简化到与我们处理文字输入的相同的无革命性方式。我们显示VLT比以前的VLP模型要快数十倍,但具有竞争性或更好的下游任务性。我们的代码和预先训练前重量可以在 https://github.com/dandelinvil.

2

相关内容

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

【CVPR2021】预训练图像处理Transformer

专知会员服务

46+阅读 · 2021年6月1日

【AAAI2021】面向真实世界的鲁棒视觉信息提取:新的数据集和新颖的解决方案

【AAAI2021】面向真实世界的鲁棒视觉信息提取:新的数据集和新颖的解决方案

专知会员服务

22+阅读 · 2021年2月17日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

324+阅读 · 2020年11月26日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

专知会员服务

24+阅读 · 2020年3月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

灾难性遗忘问题新视角：迁移-干扰平衡

灾难性遗忘问题新视角：迁移-干扰平衡

CreateAMind

17+阅读 · 2019年7月6日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

PaperWeekly

7+阅读 · 2019年5月5日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【推荐】YOLO实时目标检测(6fps)

【推荐】YOLO实时目标检测(6fps)

机器学习研究会

20+阅读 · 2017年11月5日

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Arxiv

0+阅读 · 2021年8月4日

Vision Transformer with Progressive Sampling

Arxiv

0+阅读 · 2021年8月3日

SiT: Self-supervised vIsion Transformer

Arxiv

19+阅读 · 2021年4月8日

End-to-End Object Detection with Fully Convolutional Network

Arxiv

9+阅读 · 2021年3月26日

End-to-End Video Instance Segmentation with Transformers

Arxiv

10+阅读 · 2021年3月24日

Dynamic Region-Aware Convolution

Arxiv

7+阅读 · 2021年3月15日

Temporal Relational Modeling with Self-Supervision for Action Segmentation

Arxiv

13+阅读 · 2020年12月14日

CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation

Arxiv

8+阅读 · 2020年12月7日

Speech2Action: Cross-modal Supervision for Action Recognition

Speech2Action: Cross-modal Supervision for Action Recognition

Arxiv

7+阅读 · 2020年3月30日

Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation

Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation

Arxiv

3+阅读 · 2019年3月6日

VIP会员

文章信息

相关主题

Processing（编程语言）

相关VIP内容

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

【CVPR2021】预训练图像处理Transformer

专知会员服务

46+阅读 · 2021年6月1日

【AAAI2021】面向真实世界的鲁棒视觉信息提取:新的数据集和新颖的解决方案

【AAAI2021】面向真实世界的鲁棒视觉信息提取:新的数据集和新颖的解决方案

专知会员服务

22+阅读 · 2021年2月17日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

324+阅读 · 2020年11月26日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

专知会员服务

24+阅读 · 2020年3月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

【NeurIPS2025】VideoLucy：用于长视频理解的深度记忆回溯机制

不确定环境下无人机与无人地面车辆编队的地下勘探规划算法 | 122页

【NTU博士论文】端到端鲁棒自动语音识别的最新进展

用于强化学习的扩散模型：基础、分类与发展

相关资讯

灾难性遗忘问题新视角：迁移-干扰平衡

灾难性遗忘问题新视角：迁移-干扰平衡

CreateAMind

17+阅读 · 2019年7月6日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

PaperWeekly

7+阅读 · 2019年5月5日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【推荐】YOLO实时目标检测(6fps)

【推荐】YOLO实时目标检测(6fps)

机器学习研究会

20+阅读 · 2017年11月5日

相关论文

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Arxiv

0+阅读 · 2021年8月4日

Vision Transformer with Progressive Sampling

Arxiv

0+阅读 · 2021年8月3日

SiT: Self-supervised vIsion Transformer

Arxiv

19+阅读 · 2021年4月8日

End-to-End Object Detection with Fully Convolutional Network

Arxiv

9+阅读 · 2021年3月26日

End-to-End Video Instance Segmentation with Transformers

Arxiv

10+阅读 · 2021年3月24日

Dynamic Region-Aware Convolution

Arxiv

7+阅读 · 2021年3月15日

Temporal Relational Modeling with Self-Supervision for Action Segmentation

Arxiv

13+阅读 · 2020年12月14日

CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation

Arxiv

8+阅读 · 2020年12月7日

Speech2Action: Cross-modal Supervision for Action Recognition

Speech2Action: Cross-modal Supervision for Action Recognition

Arxiv

7+阅读 · 2020年3月30日

Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation

Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation

Arxiv

3+阅读 · 2019年3月6日

微信扫码咨询专知VIP会员