Forms are a widespread type of template-based document used in a great variety of fields including, among others, administration, medicine, finance, or insurance. The automatic extraction of the information included in these documents is greatly demanded due to the increasing volume of forms that are generated in a daily basis. However, this is not a straightforward task when working with scanned forms because of the great diversity of templates with different location of form entities, and the quality of the scanned documents. In this context, there is a feature that is shared by all forms: they contain a collection of interlinked entities built as key-value (or label-value) pairs, together with other entities such as headers or images. In this work, we have tacked the problem of entity linking in forms by combining image processing techniques and a text classification model based on the BERT architecture. This approach achieves state-of-the-art results with a F1-score of 0.80 on the FUNSD dataset, a 5% improvement regarding the best previous method. The code of this project is available at https://github.com/mavillot/FUNSD-Entity-Linking.
翻译:由于每天生成的表格数量不断增加,要求自动提取这些文件中所包含的信息的要求很大。然而,在使用扫描表格时,这不是一项简单的任务,因为格式实体不同地点的模板差异很大,扫描文件的质量也各不相同。在这方面,存在着一种所有形式都共享的特征:它们包含作为关键价值(或标签价值)对与信头或图像等其他实体一起建立的相互关联的实体的集合。在这项工作中,我们通过将图像处理技术和基于BERT结构的文本分类模型相结合的方式,解决了实体以形式连接的问题。这种方法在FUNSD数据集上实现了最先进的结果,F1-核心为0.80,这是以前最佳方法的5%的改进。这个项目的代码见https://github.com/mavillot/FSD-Entity-Linking。