粗俗到精美的愿景-语言预培训,与在后骨的融合结合 (Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone) - 专知论文

会员服务 ·

0

视觉问答 · 图像字幕 · 多峰值 · 可理解性 · 讲稿 ·

2022 年 6 月 15 日

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

翻译：粗俗到精美的愿景-语言预培训,与在后骨的融合结合

Zi-Yi Dou,Aishwarya Kamath,Zhe Gan,Pengchuan Zhang,Jianfeng Wang,Linjie Li,Zicheng Liu,Ce Liu,Yann LeCun,Nanyun Peng,Jianfeng Gao,Lijuan Wang

from arxiv, Project Website: https://ashkamath.github.io/FIBER_page

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is available at https://github.com/microsoft/FIBER.

翻译：培训前的视觉语言(VL)最近受到相当的重视,然而,大多数现有的培训前端到端前的方法,不是仅仅旨在完成图像-文字检索、视觉问答(VQA)和图像说明等VL任务,就是测试图像高级理解的图像,或者仅仅针对区域一级的任务,如文字定位和物体探测等。我们介绍了一个新的VBER(Fusion-In-In-Backbone-backbone-sbrouper)模型结构,可以无缝地处理这两种类型的任务。而不是专门为单式骨架之后的融合而设置变压器层,FIBER常常通过在图像和文字主干网中插入交叉关注的图像和文字解答(VQBER),在记忆和性能方面带来进步。此外,我们与以往的工作不同,要么只是对图像-文字数据进行预先培训,要么是带有箱级说明的精细度数据,我们提出了一种两阶段的训练前战略,使用这两种有效的数据:(i) 精确的Qri-in-in-real relifrial relial-real relial relial reliation aly)

0

相关内容

视觉问答

视觉问答（Visual Question Answering，VQA），是一种涉及计算机视觉和自然语言处理的学习任务。这一任务的定义如下： A VQA system takes as input an image and a free-form, open-ended, natural-language question about the image and produces a natural-language answer as the output[1]。翻译为中文：一个VQA系统以一张图片和一个关于这张图片形式自由、开放式的自然语言问题作为输入，以生成一条自然语言答案作为输出。简单来说，VQA就是给定的图片进行问答。

知识荟萃

精品入门和进阶教程、论文和代码整理等

更多

查看相关VIP内容、论文、资讯等

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

中国图象图形学学会CSIG

0+阅读 · 2021年11月8日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

抗MRSA活性rhodomyrtosone B类似物的合成和构效关系研究

国家自然科学基金

0+阅读 · 2015年12月31日

内生真菌Shiraia sp.Slf14竹红菌素生物合成途径及代谢调控

国家自然科学基金

0+阅读 · 2014年12月31日

偕二氟取代Combretastatins衍生物的设计与合成

国家自然科学基金

0+阅读 · 2014年12月31日

海量局部近似图像检索关键技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于单根ZnO纳米线的近紫外电致发光器件的研究

国家自然科学基金

0+阅读 · 2013年12月31日

考虑观测值时空相关性的InSAR三维形变估计方法

国家自然科学基金

0+阅读 · 2013年12月31日

基于激光准直光线漂移补偿的高精度滚转角测量方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

微生物药物生物合成转化知识库的构建

国家自然科学基金

2+阅读 · 2012年12月31日

hMSCs定向汗腺细胞分化中TRAF6信号复合物活化不同NF-κB通路的机制

国家自然科学基金

0+阅读 · 2011年12月31日

基于Sparse-Land模型的SAR图像噪声抑制与分割

国家自然科学基金

0+阅读 · 2009年12月31日

PPT Fusion: Pyramid Patch Transformerfor a Case Study in Image Fusion

Arxiv

0+阅读 · 2022年8月3日

In Defense of Image Pre-Training for Spatiotemporal Recognition

Arxiv

0+阅读 · 2022年8月1日

DictBERT: Dictionary Description Knowledge Enhanced Language Model Pre-training via Contrastive Learning

Arxiv

1+阅读 · 2022年8月1日

GraphMFT: A Graph Attention based Multimodal Fusion Technique for Emotion Recognition in Conversation

Arxiv

0+阅读 · 2022年7月31日

A Survey on Vision Transformer

Arxiv

17+阅读 · 2022年2月23日

Transformers in Medical Imaging: A Survey

Arxiv

15+阅读 · 2022年1月24日

Pre-training Methods in Information Retrieval

Arxiv

16+阅读 · 2021年11月27日

A Survey of Visual Transformers

Arxiv

39+阅读 · 2021年11月11日

A Survey of Knowledge-Enhanced Text Generation

Arxiv

18+阅读 · 2020年10月9日

Object Detection in Optical Remote Sensing Images: A Survey and A New Benchmark

Object Detection in Optical Remote Sensing Images: A Survey and A New Benchmark

Arxiv

46+阅读 · 2019年9月22日

VIP会员

文章信息

相关主题

相关VIP内容

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【新书】《知识图谱与大语言模型的协同应用》，544页pdf

军事通信系统：安全行动的支柱

《缓解大语言模型（LLMs）幻觉：面向应用的检索增强生成（RAG）、推理与智能体系统综述》

【新书】机器学习系统，2620页pdf

相关资讯

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

中国图象图形学学会CSIG

0+阅读 · 2021年11月8日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

PPT Fusion: Pyramid Patch Transformerfor a Case Study in Image Fusion

Arxiv

0+阅读 · 2022年8月3日

In Defense of Image Pre-Training for Spatiotemporal Recognition

Arxiv

0+阅读 · 2022年8月1日

DictBERT: Dictionary Description Knowledge Enhanced Language Model Pre-training via Contrastive Learning

Arxiv

1+阅读 · 2022年8月1日

GraphMFT: A Graph Attention based Multimodal Fusion Technique for Emotion Recognition in Conversation

Arxiv

0+阅读 · 2022年7月31日

A Survey on Vision Transformer

Arxiv

17+阅读 · 2022年2月23日

Transformers in Medical Imaging: A Survey

Arxiv

15+阅读 · 2022年1月24日

Pre-training Methods in Information Retrieval

Arxiv

16+阅读 · 2021年11月27日

A Survey of Visual Transformers

Arxiv

39+阅读 · 2021年11月11日

A Survey of Knowledge-Enhanced Text Generation

Arxiv

18+阅读 · 2020年10月9日

Object Detection in Optical Remote Sensing Images: A Survey and A New Benchmark

Object Detection in Optical Remote Sensing Images: A Survey and A New Benchmark

Arxiv

46+阅读 · 2019年9月22日

相关基金

抗MRSA活性rhodomyrtosone B类似物的合成和构效关系研究

国家自然科学基金

0+阅读 · 2015年12月31日

内生真菌Shiraia sp.Slf14竹红菌素生物合成途径及代谢调控

国家自然科学基金

0+阅读 · 2014年12月31日

偕二氟取代Combretastatins衍生物的设计与合成

国家自然科学基金

0+阅读 · 2014年12月31日

海量局部近似图像检索关键技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于单根ZnO纳米线的近紫外电致发光器件的研究

国家自然科学基金

0+阅读 · 2013年12月31日

考虑观测值时空相关性的InSAR三维形变估计方法

国家自然科学基金

0+阅读 · 2013年12月31日

基于激光准直光线漂移补偿的高精度滚转角测量方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

微生物药物生物合成转化知识库的构建

国家自然科学基金

2+阅读 · 2012年12月31日

hMSCs定向汗腺细胞分化中TRAF6信号复合物活化不同NF-κB通路的机制

国家自然科学基金

0+阅读 · 2011年12月31日

基于Sparse-Land模型的SAR图像噪声抑制与分割

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员