最终至最后文本阅读和信息提取,以了解文件 (TRIE: End-to-End Text Reading and Information Extraction for Document Understanding)

from arxiv, Accepted to ACM MM2020. Code is available at https://davar-lab.github.io/publication.html or https://github.com/hikopensource/DAVAR-Lab-OCR

Since real-world ubiquitous documents (e.g., invoices, tickets, resumes and leaflets) contain rich information, automatic document image understanding has become a hot topic. Most existing works decouple the problem into two separate tasks, (1) text reading for detecting and recognizing texts in images and (2) information extraction for analyzing and extracting key elements from previously extracted plain text. However, they mainly focus on improving information extraction task, while neglecting the fact that text reading and information extraction are mutually correlated. In this paper, we propose a unified end-to-end text reading and information extraction network, where the two tasks can reinforce each other. Specifically, the multimodal visual and textual features of text reading are fused for information extraction and in turn, the semantics in information extraction contribute to the optimization of text reading. On three real-world datasets with diverse document images (from fixed layout to variable layout, from structured text to semi-structured text), our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.

翻译：由于真实世界无处不在的文件(如发票、机票、简历和传单)包含丰富的信息,自动文件图像理解已成为一个热门话题,大多数现有作品将问题分为两个不同的任务:(1) 用于探测和识别图像文本的读物,(2) 用于分析和提取先前提取的纯文本中的关键内容的信息提取,然而,它们主要侧重于改进信息提取任务,而忽视了文本读物和信息提取是相互联系的这一事实。在本文件中,我们建议建立一个统一的端对端文本读物和信息提取网络,这两个任务可以相互加强。具体地说,文本阅读的多式视觉和文本特征为信息提取而结合,信息提取中的语义有助于文本阅读的优化。在三个包含不同文件图像(从固定版式到结构化文本到半结构化文本)的现实世界数据集中,我们提出的方法大大超越了效率和准确性方面的最新方法。

相关内容

信息抽取

关注 350

信息抽取（Information Extraction: IE）是把文本里包含的信息进行结构化处理，变成表格一样的组织形式。输入信息抽取系统的是原始文本，输出的是固定格式的信息点。信息点从各种各样的文档中被抽取出来，然后以统一的形式集成在一起。这就是信息抽取的主要任务。信息以统一的形式集成在一起的好处是方便检查和比较。信息抽取技术并不试图全面理解整篇文档，只是对文档中包含相关信息的部分进行分析。至于哪些信息是相关的，那将由系统设计时定下的领域范围而定。

【ACL2020】Span-ConveRT：预训练对话表示小样本跨度提取，Span-ConveRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations

专知会员服务

17+阅读 · 2020年5月19日

【2020关键词提取】基于深度神经网络的关键词提取，Keywords extraction with deep neural network model

专知会员服务

60+阅读 · 2020年5月2日

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

专知会员服务

17+阅读 · 2020年4月10日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日