LAMBERT: 用于信息提取的布局软件(语言)建模 (LAMBERT: Layout-Aware (Language) Modeling for information extraction)

from arxiv, v1: 9 pages; work in progress; this version of the paper was submitted to review on Dec 10, 2019, and subsequently withdrawn on Feb 17, 2020 v2: 17 pages v3: 18 pages, 2 appendices v4: 15 pages

We introduce a new simple approach to the problem of understanding documents where non-trivial layout influences the local semantics. To this end, we modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn the language semantics from scratch. We augment the input of the model only with the coordinates of token bounding boxes, avoiding the use of raw images. This leads to a layout-aware language model which can be then fine-tuned on downstream tasks. The model is evaluated on an end-to-end information extraction task using four publicly available datasets: Kleister NDA, Kleister Charity, SROIE and CORD. We show that it achieves superior performance on datasets consisting of visually rich documents, at the same time outperforming the baseline RoBERTa on documents with flat layout (NDA F1 increase from 78.50 to 80.42). Our solution ranked 1st on the public leaderboard for the Key Information Extraction from the SROIE dataset, improving the SOTA F1-score from 97.81 to 98.17.

翻译：在非三角布局影响本地语义学的地方,我们采用新的简单方法解决理解文件的问题。为此,我们修改变异器编码器结构,使其使用从OCR系统获得的布局功能,而不必从头重读语言语义学。我们只用符号捆绑框的坐标来增加模型的输入,避免使用原始图像。这导致形成一个能够对下游任务进行微调的布局认知语言模型。该模型是利用四个公开数据集(Kleister NDA、Kleister Charity、SROIE和CORD)来评估端到端端端的信息提取任务。我们显示,它实现了由视觉丰富文件组成的数据集的优异性性功能,同时比平板布局文件上的RoBERTA基线(NDA F1从78.50增加到80.42)。我们的解决方案在SROIE数据集的关键信息提取公共头板上排名第1位,改进了SOTA F1核心,从97.81到98.17。

相关内容

信息抽取

关注 350

信息抽取（Information Extraction: IE）是把文本里包含的信息进行结构化处理，变成表格一样的组织形式。输入信息抽取系统的是原始文本，输出的是固定格式的信息点。信息点从各种各样的文档中被抽取出来，然后以统一的形式集成在一起。这就是信息抽取的主要任务。信息以统一的形式集成在一起的好处是方便检查和比较。信息抽取技术并不试图全面理解整篇文档，只是对文档中包含相关信息的部分进行分析。至于哪些信息是相关的，那将由系统设计时定下的领域范围而定。

字节跳动李航提出AMBERT！超越BERT！多粒度token预训练语言模型

专知会员服务

41+阅读 · 2020年8月31日

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

专知会员服务

97+阅读 · 2020年4月10日

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

专知会员服务

26+阅读 · 2020年3月19日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日