Recently, automatically extracting information from visually rich documents (e.g., tickets and resumes) has become a hot and vital research topic due to its widespread commercial value. Most existing methods divide this task into two subparts: the text reading part for obtaining the plain text from the original document images and the information extraction part for extracting key contents. These methods mainly focus on improving the second, while neglecting that the two parts are highly correlated. This paper proposes a unified end-to-end information extraction framework from visually rich documents, where text reading and information extraction can reinforce each other via a well-designed multi-modal context block. Specifically, the text reading part provides multi-modal features like visual, textual and layout features. The multi-modal context block is developed to fuse the generated multi-modal features and even the prior knowledge from the pre-trained language model for better semantic representation. The information extraction part is responsible for generating key contents with the fused context features. The framework can be trained in an end-to-end trainable manner, achieving global optimization. What is more, we define and group visually rich documents into four categories across two dimensions, the layout and text type. For each document category, we provide or recommend the corresponding benchmarks, experimental settings and strong baselines for remedying the problem that this research area lacks the uniform evaluation standard. Extensive experiments on four kinds of benchmarks (from fixed layout to variable layout, from full-structured text to semi-unstructured text) are reported, demonstrating the proposed method's effectiveness. Data, source code and models are available.
翻译:最近,从视觉丰富文件(如机票和简历)中自动提取信息,由于其广泛的商业价值,已成为一个热点和重要的研究主题。大多数现有方法将这项任务分为两个小部分:从原始文档图像中获取纯文本的文本读数部分和从关键内容提取信息提取部分。这些方法主要侧重于改进第二个部分,而忽视了这两个部分高度关联性。本文建议从视觉丰富文件(如机票和简历)中建立统一的端对端信息提取框架,文本阅读和信息提取可以通过一个设计良好的多模式背景块相互加强。具体地说,文本阅读部分提供了视觉、文本和布局特征等多种模式特征。多模式背景部分开发了将生成的多模式特征连结起来,甚至将先前从经过培训的语言模型中获得的知识连接起来,以更好地表达语义。信息提取部分负责从视觉丰富的背景特征生成关键内容。这个框架可以用端对端对端的代码进行培训,通过一个设计良好的多模式实现全球优化。具体地说,我们定义和分组上报的丰富文件,将整个结构分为四大类,从两个不同层次的缩缩缩缩缩缩缩缩的缩缩缩缩缩的缩缩缩图,我们提供了数据,为不同的缩缩略的缩略图和文本。