We introduce a simple new approach to the problem of understanding documents where non-trivial layout influences the local semantics. To this end, we modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn language semantics from scratch. We only augment the input of the model with the coordinates of token bounding boxes, avoiding, in this way, the use of raw images. This leads to a layout-aware language model which can then be fine-tuned on downstream tasks. The model is evaluated on an end-to-end information extraction task using four publicly available datasets: Kleister NDA, Kleister Charity, SROIE and CORD. We show that our model achieves superior performance on datasets consisting of visually rich documents, while also outperforming the baseline RoBERTa on documents with flat layout (NDA \(F_{1}\) increase from 78.50 to 80.42). Our solution ranked first on the public leaderboard for the Key Information Extraction from the SROIE dataset, improving the SOTA \(F_{1}\)-score from 97.81 to 98.17.
翻译:我们引入了一种简单的新方法来解决理解文件问题, 非三角布局会影响本地语义学。 为此, 我们修改变异器编码器结构, 使其使用从OCR系统获得的布局功能, 不需要从零开始重新读取语言语义。 我们只能用符号捆绑框的坐标来增加模型的输入, 从而避免使用原始图像 。 这导致一个布局通语言模型, 然后可以对下游任务进行微调。 该模型是使用四种公开数据集( Kleister NDA、 Kleisster Charity、 SROIE和CORD)来评估端到端的信息提取任务。 我们显示, 我们的模型在由视觉丰富文件组成的数据集上取得了优异性性, 同时在使用平版版( NDA\ (F ⁇ 1 ⁇ ) 超过文件的基线值, 从78. 50 到80.42 。 我们的解决方案首先排在SROIE数据集的关键信息提取器的公共领导板上, 改进SOTA\\ 17核心。