DocFormer: 文档理解端到端转换器 (DocFormer: End-to-End Transformer for Document Understanding) - 专知论文

会员服务 ·

0

可理解性 · 变换 · 端到端 · state-of-the-art · MoDELS ·

2021 年 9 月 20 日

DocFormer: End-to-End Transformer for Document Understanding

翻译：DocFormer: 文档理解端到端转换器

Srikar Appalaraju,Bhavan Jasani,Bhargava Urala Kota,Yusheng Xie,R. Manmatha

from arxiv, Accepted to ICCV 2021 main conference

We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

翻译：我们介绍了基于多式变压器的DocFormer(一个基于多式变压器的架构),用以完成视觉文档理解(VDU)的任务。VDU是一个具有挑战性的问题,目的是以不同格式(形式、收据等)和布局来理解文件。此外,DocFormer(DocFormer)使用精心设计、鼓励多式互动的任务,在没有监督的情况下接受了预先培训。DocFormer(DocFormer)使用文字、视觉和空间特征,并使用新型的多式自省层进行合并。DocFormer(DocFormer)还分享了各种模式之间的空间嵌入,使模型更容易将文字与视觉符号联系起来,反之亦然。DocFormer(DocFormer)在4个不同的数据集上进行了评估,每个数据集都有很强的基线。DocFormer(DocFormer)在所有这些数据集上都取得了最新的结果,有时击打4x 其大小(没有参数)。

0

相关内容

可理解性

人脸活体检测综述

专知会员服务

34+阅读 · 2021年9月8日

小目标检测研究进展

专知会员服务

43+阅读 · 2021年7月10日

《神经架构搜索NAS》最新进展综述

《神经架构搜索NAS》最新进展综述

专知会员服务

57+阅读 · 2020年8月12日

最新《自动机器学习》综述论文，AutoML: A Survey of the State-of-the-Art

最新《自动机器学习》综述论文，AutoML: A Survey of the State-of-the-Art

专知会员服务

93+阅读 · 2020年7月10日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

ExBert — 可视化分析Transformer学到的表示

ExBert — 可视化分析Transformer学到的表示

专知会员服务

32+阅读 · 2019年10月16日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

LibRec 精选：AutoML for Contextual Bandits

LibRec 精选：AutoML for Contextual Bandits

LibRec智能推荐

7+阅读 · 2019年9月19日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

【论文】图上的表示学习综述

【论文】图上的表示学习综述

机器学习研究会

15+阅读 · 2017年9月24日

【推荐】视频目标分割基础

【推荐】视频目标分割基础

机器学习研究会

9+阅读 · 2017年9月19日

【推荐】深度学习目标检测全面综述

【推荐】深度学习目标检测全面综述

机器学习研究会

21+阅读 · 2017年9月13日

【推荐】全卷积语义分割综述

【推荐】全卷积语义分割综述

机器学习研究会

19+阅读 · 2017年8月31日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

Improving Large-scale Language Models and Resources for Filipino

Arxiv

0+阅读 · 2021年11月11日

Multimodal Approach for Metadata Extraction from German Scientific Publications

Arxiv

0+阅读 · 2021年11月10日

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Arxiv

13+阅读 · 2021年4月7日

End-to-End Video Instance Segmentation with Transformers

Arxiv

10+阅读 · 2021年3月24日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

Unified Vision-Language Pre-Training for Image Captioning and VQA

Unified Vision-Language Pre-Training for Image Captioning and VQA

Arxiv

8+阅读 · 2019年10月3日

How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations

Arxiv

4+阅读 · 2019年9月11日

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Arxiv

15+阅读 · 2018年10月11日

End-to-End Text Classification via Image-based Embedding using Character-level Networks

End-to-End Text Classification via Image-based Embedding using Character-level Networks

Arxiv

5+阅读 · 2018年10月10日

VIP会员

文章信息

相关主题

state-of-the-art

相关VIP内容

人脸活体检测综述

专知会员服务

34+阅读 · 2021年9月8日

小目标检测研究进展

专知会员服务

43+阅读 · 2021年7月10日

《神经架构搜索NAS》最新进展综述

《神经架构搜索NAS》最新进展综述

专知会员服务

57+阅读 · 2020年8月12日

最新《自动机器学习》综述论文，AutoML: A Survey of the State-of-the-Art

最新《自动机器学习》综述论文，AutoML: A Survey of the State-of-the-Art

专知会员服务

93+阅读 · 2020年7月10日

Transformer文本分类代码

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

ExBert — 可视化分析Transformer学到的表示

ExBert — 可视化分析Transformer学到的表示

专知会员服务

32+阅读 · 2019年10月16日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】移动计算摄影的神经场表示

大语言模型遇见法律人工智能：综述

【ICCV2025】InfGen：一种分辨率无关的可扩展图像合成范式

美军用无人地面战车发展：现代战争中超越弹药的多元应用

相关资讯

LibRec 精选：AutoML for Contextual Bandits

LibRec 精选：AutoML for Contextual Bandits

LibRec智能推荐

7+阅读 · 2019年9月19日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

【论文】图上的表示学习综述

【论文】图上的表示学习综述

机器学习研究会

15+阅读 · 2017年9月24日

【推荐】视频目标分割基础

【推荐】视频目标分割基础

机器学习研究会

9+阅读 · 2017年9月19日

【推荐】深度学习目标检测全面综述

【推荐】深度学习目标检测全面综述

机器学习研究会

21+阅读 · 2017年9月13日

【推荐】全卷积语义分割综述

【推荐】全卷积语义分割综述

机器学习研究会

19+阅读 · 2017年8月31日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

相关论文

Improving Large-scale Language Models and Resources for Filipino

Arxiv

0+阅读 · 2021年11月11日

Multimodal Approach for Metadata Extraction from German Scientific Publications

Arxiv

0+阅读 · 2021年11月10日

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Arxiv

13+阅读 · 2021年4月7日

End-to-End Video Instance Segmentation with Transformers

Arxiv

10+阅读 · 2021年3月24日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

Unified Vision-Language Pre-Training for Image Captioning and VQA

Unified Vision-Language Pre-Training for Image Captioning and VQA

Arxiv

8+阅读 · 2019年10月3日

How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations

Arxiv

4+阅读 · 2019年9月11日

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Arxiv

15+阅读 · 2018年10月11日

End-to-End Text Classification via Image-based Embedding using Character-level Networks

End-to-End Text Classification via Image-based Embedding using Character-level Networks

Arxiv

5+阅读 · 2018年10月10日

微信扫码咨询专知VIP会员