GIT:为愿景和语言创造图像到文字变革器 (GIT: A Generative Image-to-text Transformer for Vision and Language) - 专知论文

会员服务 ·

0

Git · Performer · MoDELS · OCR · Vision ·

2022 年 6 月 27 日

GIT: A Generative Image-to-text Transformer for Vision and Language

翻译：GIT:为愿景和语言创造图像到文字变革器

Jianfeng Wang,Zhengyuan Yang,Xiaowei Hu,Linjie Li,Kevin Lin,Zhe Gan,Zicheng Liu,Ce Liu,Lijuan Wang

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.

翻译：在本文中,我们设计并培训了一个“产生图像到文字变换器”,即“GIT”,以统一图像/视频字幕和答题等视觉语言任务。虽然基因模型在培训前和微调之间提供了一个一致的网络结构,但现有工作通常包含复杂的结构(单式/多式编码器/代码器),并依赖物体探测器/标签器和光学字符识别等外部模块。在“GIT”中,我们将结构简化为一个图像编码器和文本解码器,在一个单一语言建模任务下。我们还扩大了培训前数据和模型尺寸,以提升模型性能。没有钟声和哨子,我们的“GIT”在12个挑战性基准上建立了新的艺术状态,有很大的利润。例如,我们的模型首次在TextCaps(138.2对CIDER的125.5)上超越了人类的性能。此外,我们提出了一个新一代图像分类和场景文本识别新计划,在标准基准上实现了体面的业绩。

0

相关内容

Git

Git 是一个为了更好地管理 Linux 内核开发而创立的分布式版本控制和软件配置管理软件。国内外知名 Git 代码托管网站有： http://GitHub.com http://Coding.net http://code.csdn.net ...

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

新型中高能轻粒子二维位置灵敏探测器研制

国家自然科学基金

0+阅读 · 2012年12月31日

Corin介导的ANP活化在动脉粥样硬化形成及其炎症反应中的作用与机制

国家自然科学基金

0+阅读 · 2012年12月31日

中国芽胞杆菌属资源分类及系统发育研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于语义的图像合成

国家自然科学基金

0+阅读 · 2011年12月31日

prfA基因突变对单核细胞增生性李氏杆菌毒力及免疫原性的影响

国家自然科学基金

0+阅读 · 2009年12月31日

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Arxiv

1+阅读 · 2022年8月18日

CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks

Arxiv

1+阅读 · 2022年8月16日

End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks

Arxiv

0+阅读 · 2022年8月15日

Survey: Transformer based Video-Language Pre-training

Arxiv

20+阅读 · 2021年9月21日

Unifying Vision-and-Language Tasks via Text Generation

Arxiv

10+阅读 · 2021年2月4日

VIP会员

文章信息

相关主题

相关VIP内容

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

智能体工程（Agent Engineering）

《全球地缘政治环境中的反无人机系统互操作性》252页

专业软件开发者不靠“氛围编程”（Vibe Coding），而靠“控制”：2025 年 AI Agent 在编程中的应用研究

基于大语言模型的智能体化软件问题解决：综述

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

相关论文

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Arxiv

1+阅读 · 2022年8月18日

CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks

Arxiv

1+阅读 · 2022年8月16日

End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks

Arxiv

0+阅读 · 2022年8月15日

Survey: Transformer based Video-Language Pre-training

Arxiv

20+阅读 · 2021年9月21日

Unifying Vision-and-Language Tasks via Text Generation

Arxiv

10+阅读 · 2021年2月4日

相关基金

新型中高能轻粒子二维位置灵敏探测器研制

国家自然科学基金

0+阅读 · 2012年12月31日

Corin介导的ANP活化在动脉粥样硬化形成及其炎症反应中的作用与机制

国家自然科学基金

0+阅读 · 2012年12月31日

中国芽胞杆菌属资源分类及系统发育研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于语义的图像合成

国家自然科学基金

0+阅读 · 2011年12月31日

prfA基因突变对单核细胞增生性李氏杆菌毒力及免疫原性的影响

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员