Since real-world ubiquitous documents (e.g., invoices, tickets, resumes and leaflets) contain rich information, automatic document image understanding has become a hot topic. Most existing works decouple the problem into two separate tasks, (1) text reading for detecting and recognizing texts in images and (2) information extraction for analyzing and extracting key elements from previously extracted plain text. However, they mainly focus on improving information extraction task, while neglecting the fact that text reading and information extraction are mutually correlated. In this paper, we propose a unified end-to-end text reading and information extraction network, where the two tasks can reinforce each other. Specifically, the multimodal visual and textual features of text reading are fused for information extraction and in turn, the semantics in information extraction contribute to the optimization of text reading. On three real-world datasets with diverse document images (from fixed layout to variable layout, from structured text to semi-structured text), our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.
翻译:由于真实世界无处不在的文件(如发票、机票、简历和传单)包含丰富的信息,自动文件图像理解已成为一个热门话题,大多数现有作品将问题分为两个不同的任务:(1) 用于探测和识别图像文本的读物,(2) 用于分析和提取先前提取的纯文本中的关键内容的信息提取,然而,它们主要侧重于改进信息提取任务,而忽视了文本读物和信息提取是相互联系的这一事实。在本文件中,我们建议建立一个统一的端对端文本读物和信息提取网络,这两个任务可以相互加强。具体地说,文本阅读的多式视觉和文本特征为信息提取而结合,信息提取中的语义有助于文本阅读的优化。在三个包含不同文件图像(从固定版式到结构化文本到半结构化文本)的现实世界数据集中,我们提出的方法大大超越了效率和准确性方面的最新方法。