We introduce a new simple approach to the problem of understanding documents where non-trivial layout influences the local semantics. To this end, we modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn the language semantics from scratch. We augment the input of the model only with the coordinates of token bounding boxes, avoiding the use of raw images. This leads to a layout-aware language model which can be then fine-tuned on downstream tasks. The model is evaluated on an end-to-end information extraction task using four publicly available datasets: Kleister NDA, Kleister Charity, SROIE and CORD. We show that it achieves superior performance on datasets consisting of visually rich documents, at the same time outperforming the baseline RoBERTa on documents with flat layout (NDA F1 increase from 78.50 to 80.42). Our solution ranked 1st on the public leaderboard for the Key Information Extraction from the SROIE dataset, improving the SOTA F1-score from 97.81 to 98.17.
翻译:在非三角布局影响本地语义学的地方,我们采用新的简单方法解决理解文件的问题。 为此,我们修改变异器编码器结构,使其使用从OCR系统获得的布局功能,而不必从头重读语言语义学。我们只用符号捆绑框的坐标来增加模型的输入,避免使用原始图像。这导致形成一个能够对下游任务进行微调的布局认知语言模型。该模型是利用四个公开数据集(Kleister NDA、Kleister Charity、SROIE和CORD)来评估端到端端端的信息提取任务。我们显示,它实现了由视觉丰富文件组成的数据集的优异性性功能,同时比平板布局文件上的RoBERTA基线(NDA F1从78.50增加到80.42)。我们的解决方案在SROIE数据集的关键信息提取公共头板上排名第1位,改进了SOTA F1核心,从97.81到98.17。