LAMBERT: 用于信息提取的布局软件(语言)建模 (LAMBERT: Layout-Aware (Language) Modeling for information extraction)

We introduce a simple new approach to the problem of understanding documents where non-trivial layout influences the local semantics. To this end, we modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn language semantics from scratch. We only augment the input of the model with the coordinates of token bounding boxes, avoiding, in this way, the use of raw images. This leads to a layout-aware language model which can then be fine-tuned on downstream tasks. The model is evaluated on an end-to-end information extraction task using four publicly available datasets: Kleister NDA, Kleister Charity, SROIE and CORD. We show that our model achieves superior performance on datasets consisting of visually rich documents, while also outperforming the baseline RoBERTa on documents with flat layout (NDA \(F_{1}\) increase from 78.50 to 80.42). Our solution ranked first on the public leaderboard for the Key Information Extraction from the SROIE dataset, improving the SOTA \(F_{1}\)-score from 97.81 to 98.17.

翻译：我们引入了一种简单的新方法来解决理解文件问题, 非三角布局会影响本地语义学。为此, 我们修改变异器编码器结构, 使其使用从OCR系统获得的布局功能, 不需要从零开始重新读取语言语义。我们只能用符号捆绑框的坐标来增加模型的输入, 从而避免使用原始图像。这导致一个布局通语言模型, 然后可以对下游任务进行微调。该模型是使用四种公开数据集( Kleister NDA、 Kleisster Charity、 SROIE和CORD)来评估端到端的信息提取任务。我们显示, 我们的模型在由视觉丰富文件组成的数据集上取得了优异性性, 同时在使用平版版( NDA\ (F ⁇ 1 ⁇ ) 超过文件的基线值, 从78. 50 到80.42 。我们的解决方案首先排在SROIE数据集的关键信息提取器的公共领导板上, 改进SOTA\\ 17核心。

相关内容

信息抽取

关注 350

信息抽取（Information Extraction: IE）是把文本里包含的信息进行结构化处理，变成表格一样的组织形式。输入信息抽取系统的是原始文本，输出的是固定格式的信息点。信息点从各种各样的文档中被抽取出来，然后以统一的形式集成在一起。这就是信息抽取的主要任务。信息以统一的形式集成在一起的好处是方便检查和比较。信息抽取技术并不试图全面理解整篇文档，只是对文档中包含相关信息的部分进行分析。至于哪些信息是相关的，那将由系统设计时定下的领域范围而定。

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

【NeurIPS2020-华为】DynaBERT:具有自适应宽度和深度的动态BERT

专知会员服务

19+阅读 · 2020年10月21日

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

【CVPR2020-哈工大-京东】自监督结构建模的目标识别，Self-supervised Structure Modeling

专知会员服务

43+阅读 · 2020年4月1日