从tHe bOx外出:愿景语言代表学习的端到端培训前培训 (Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning) - 专知论文

会员服务 ·

0

学成 · 模型评估 · 端到端 · 表示学习 · state-of-the-art ·

2021 年 4 月 7 日

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

翻译：从tHe bOx外出:愿景语言代表学习的端到端培训前培训

Zhicheng Huang,Zhaoyang Zeng,Yupan Huang,Bei Liu,Dongmei Fu,Jianlong Fu

from arxiv, CVPR 2021 camera-ready

We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. State-of-the-art approaches extract salient image regions and align regions with words step-by-step. As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches. In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). We conduct experiments on four well-established vision-language tasks by following standard VLPT settings. In particular, SOHO achieves absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR$^2$ test-P split, 6.7% accuracy on SNLI-VE test split, respectively.

翻译：我们共同学习进化神经网络(CNN)和变异器的视觉语言预培训前(VLPT),目的是从成百万成百万成百万的图像文本中学习跨模式的校正。最先进的方法将突出的图像区域提取成色,并将区域与字词逐步地对齐。以区域为基础的视觉特征通常代表图像的一部分, 对现有视觉语言模型来说, 完全理解配对自然语言的语义具有挑战性。在本文中, 我们建议 SOHO 将“ 见以整幅图像作为输入, 以端对端的方式学习视觉语言的表达方式。 SOHO并不要求捆绑框说明, 能够比以区域为基础的方法更快10倍地推断。特别是, SOHO学会通过便于交叉理解的视觉字典(VD), VDD旨在代表类似语义语言的一致的视觉抽象取性。我们更新了在预培训任务中所使用的视觉语言精度的精确度, 在VV- IML 标准测试中,, 分别通过SV 测试测试的SL 实现 5 标准的SBMMMM 。

13

相关内容

【CVPR 2021】变换器跟踪TransT: Transformer Tracking

【CVPR 2021】变换器跟踪TransT: Transformer Tracking

专知会员服务

22+阅读 · 2021年4月20日

不可错过！最新《大规模机器学习》2020教程，133页ppt，台湾清华大学吴尚鸿教授

不可错过！最新《大规模机器学习》2020教程，133页ppt，台湾清华大学吴尚鸿教授

专知会员服务

58+阅读 · 2020年11月8日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【新书】深度学习自然语言处理，Deep Learning for Natural Language Processing

【新书】深度学习自然语言处理，Deep Learning for Natural Language Processing

专知会员服务

67+阅读 · 2019年12月27日

【ICCV 2019 Toturial】Interpretable Machine Learning for Computer Vision（用于计算机视觉的可解释性机器学习）

【ICCV 2019 Toturial】Interpretable Machine Learning for Computer Vision（用于计算机视觉的可解释性机器学习）

专知会员服务

32+阅读 · 2019年10月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

解读自监督学习(Self-Supervised Learning)几篇相关paper

解读自监督学习(Self-Supervised Learning)几篇相关paper

CVer

25+阅读 · 2020年2月21日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

一文纵览 Vision-and-Language 领域最新研究与进展

一文纵览 Vision-and-Language 领域最新研究与进展

AI科技评论

7+阅读 · 2019年5月14日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

基于深度学习的医疗影像论文汇总（Deep Learning Papers on Medical Image Analysis）

基于深度学习的医疗影像论文汇总（Deep Learning Papers on Medical Image Analysis）

AI研习社

17+阅读 · 2017年10月21日

【推荐】视频目标分割基础

【推荐】视频目标分割基础

机器学习研究会

9+阅读 · 2017年9月19日

【推荐】深度学习目标检测概览

【推荐】深度学习目标检测概览

机器学习研究会

10+阅读 · 2017年9月1日

Self-supervised Video Representation Learning by Context and Motion Decoupling

Arxiv

6+阅读 · 2021年4月2日

Spatially Consistent Representation Learning

Arxiv

14+阅读 · 2021年3月10日

Self-supervised pre-training and contrastive representation learning for multiple-choice video QA

Self-supervised pre-training and contrastive representation learning for multiple-choice video QA

Arxiv

5+阅读 · 2020年12月14日

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Arxiv

11+阅读 · 2020年10月20日

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Arxiv

10+阅读 · 2020年3月31日

A Simple Framework for Contrastive Learning of Visual Representations

Arxiv

21+阅读 · 2020年2月13日

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Arxiv

4+阅读 · 2019年12月3日

Unified Vision-Language Pre-Training for Image Captioning and VQA

Unified Vision-Language Pre-Training for Image Captioning and VQA

Arxiv

8+阅读 · 2019年10月3日

UNITER: Learning UNiversal Image-TExt Representations

UNITER: Learning UNiversal Image-TExt Representations

Arxiv

23+阅读 · 2019年9月25日

Deep Representation Learning for Domain Adaptation of Semantic Image Segmentation

Arxiv

10+阅读 · 2018年5月10日

VIP会员

文章信息

相关主题

state-of-the-art

相关VIP内容

【CVPR 2021】变换器跟踪TransT: Transformer Tracking

【CVPR 2021】变换器跟踪TransT: Transformer Tracking

专知会员服务

22+阅读 · 2021年4月20日

不可错过！最新《大规模机器学习》2020教程，133页ppt，台湾清华大学吴尚鸿教授

不可错过！最新《大规模机器学习》2020教程，133页ppt，台湾清华大学吴尚鸿教授

专知会员服务

58+阅读 · 2020年11月8日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【新书】深度学习自然语言处理，Deep Learning for Natural Language Processing

【新书】深度学习自然语言处理，Deep Learning for Natural Language Processing

专知会员服务

67+阅读 · 2019年12月27日

【ICCV 2019 Toturial】Interpretable Machine Learning for Computer Vision（用于计算机视觉的可解释性机器学习）

【ICCV 2019 Toturial】Interpretable Machine Learning for Computer Vision（用于计算机视觉的可解释性机器学习）

专知会员服务

32+阅读 · 2019年10月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

热门VIP内容

开通专知VIP会员享更多权益服务

《巡飞弹药（爆炸性无人机）威胁态势分析》最新24页报告

《军用后勤无人机：破解战场运输挑战的创新方案》

人工智能战争：以色列、伊朗与新型AI战争形态

《俄乌战争：现代战争未来的启示与经验》

相关资讯

解读自监督学习(Self-Supervised Learning)几篇相关paper

解读自监督学习(Self-Supervised Learning)几篇相关paper

CVer

25+阅读 · 2020年2月21日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

一文纵览 Vision-and-Language 领域最新研究与进展

一文纵览 Vision-and-Language 领域最新研究与进展

AI科技评论

7+阅读 · 2019年5月14日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

基于深度学习的医疗影像论文汇总（Deep Learning Papers on Medical Image Analysis）

基于深度学习的医疗影像论文汇总（Deep Learning Papers on Medical Image Analysis）

AI研习社

17+阅读 · 2017年10月21日

【推荐】视频目标分割基础

【推荐】视频目标分割基础

机器学习研究会

9+阅读 · 2017年9月19日

【推荐】深度学习目标检测概览

【推荐】深度学习目标检测概览

机器学习研究会

10+阅读 · 2017年9月1日

相关论文

Self-supervised Video Representation Learning by Context and Motion Decoupling

Arxiv

6+阅读 · 2021年4月2日

Spatially Consistent Representation Learning

Arxiv

14+阅读 · 2021年3月10日

Self-supervised pre-training and contrastive representation learning for multiple-choice video QA

Self-supervised pre-training and contrastive representation learning for multiple-choice video QA

Arxiv

5+阅读 · 2020年12月14日

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Arxiv

11+阅读 · 2020年10月20日

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Arxiv

10+阅读 · 2020年3月31日

A Simple Framework for Contrastive Learning of Visual Representations

Arxiv

21+阅读 · 2020年2月13日

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Arxiv

4+阅读 · 2019年12月3日

Unified Vision-Language Pre-Training for Image Captioning and VQA

Unified Vision-Language Pre-Training for Image Captioning and VQA

Arxiv

8+阅读 · 2019年10月3日

UNITER: Learning UNiversal Image-TExt Representations

UNITER: Learning UNiversal Image-TExt Representations

Arxiv

23+阅读 · 2019年9月25日

Deep Representation Learning for Domain Adaptation of Semantic Image Segmentation

Arxiv

10+阅读 · 2018年5月10日

微信扫码咨询专知VIP会员