Large pre-trained language models achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, they almost exclusively focus on text-only representation, while neglecting cell-level layout information that is important for form image understanding. In this paper, we propose a new pre-training approach, StructuralLM, to jointly leverage cell and layout information from scanned documents. Specifically, we pre-train StructuralLM with two new designs to make the most of the interactions of cell and layout information: 1) each cell as a semantic unit; 2) classification of cell positions. The pre-trained StructuralLM achieves new state-of-the-art results in different types of downstream tasks, including form understanding (from 78.95 to 85.14), document visual question answering (from 72.59 to 83.94) and document image classification (from 94.43 to 96.08).
翻译:培训前的大型语言模式在对下游国家语言平台任务进行微调时,取得了最先进的成果;然而,这些模式几乎完全侧重于仅以文字表示,而忽略了对图像形式理解十分重要的细胞版面信息;在本文件中,我们提出了一个新的培训前方法,即“结构LM”,以联合利用扫描文件中的单元格和版面信息;具体地说,我们为“结构LM”进行了培训前的两个新设计,以充分利用单元格和版面信息的互动:1)每个单元格是一个语义单位;2)对单元格职位进行了分类;培训前的结构LM在不同类型的下游任务中取得了新的最新成果,包括形式理解(从78.95到85.14)、文件直观回答(从72.59到83.94)和文件图像分类(从94.43到96.08)。