StrucTexT: 多模式变换器的结构化文本理解 (StrucTexT: Structured Text Understanding with Multi-Modal Transformers)

Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence. Due to the complexity of content and layout in VRDs, structured text understanding has been a challenging task. Most existing studies decoupled this problem into two sub-tasks: entity labeling and entity linking, which require an entire understanding of the context of documents at both token and segment levels. However, little work has been concerned with the solutions that efficiently extract the structured data from different levels. This paper proposes a unified framework named StrucTexT, which is flexible and effective for handling both sub-tasks. Specifically, based on the transformer, we introduce a segment-token aligned encoder to deal with the entity labeling and entity linking tasks at different levels of granularity. Moreover, we design a novel pre-training strategy with three self-supervised tasks to learn a richer representation. StrucTexT uses the existing Masked Visual Language Modeling task and the new Sentence Length Prediction and Paired Boxes Direction tasks to incorporate the multi-modal information across text, image, and layout. We evaluate our method for structured text understanding at segment-level and token-level and show it outperforms the state-of-the-art counterparts with significantly superior performance on the FUNSD, SROIE, and EPHOIE datasets.

翻译：关于视觉丰富文件(VRDs)的结构化文本理解是文件智能的关键部分。由于 VRDs的内容和布局的复杂性,结构化文本理解是一项艰巨的任务。大多数现有研究将这一问题分解成两个子任务:实体标签和实体链接,这要求在象征性和分段两级对文件背景有全面了解。然而,对于有效从不同级别提取结构化数据的解决办法,几乎没有什么工作关注。本文件提议了一个名为SstrucTexT的统一框架,这个框架对处理两个子任务既灵活又有效。具体地说,基于变压器,我们引入一个段式对齐的编码器,处理将这一问题分为两个子任务:实体标签和实体,在不同级别将任务连接在一起。此外,我们设计了一个新的培训前战略,有三个自监督任务来学习更丰富的代表性。 StrucTexT使用现有的保护型视觉语言建模任务和新的句式预测和组合框方向,将多模式信息纳入到文本、图像和布局之间。我们用结构化的系统化的SHO-FA-SDFA-SDFA-SDFA-SDFA-SDA-SDA-SDA-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SB-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-S-S-SD-SD-SD-S-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-S-S-SD-SD-S-S-S-S-S-S-S-A-S-S-A-A-A-A-A-A-