Compared to general document analysis tasks, form document structure understanding and retrieval are challenging. Form documents are typically made by two types of authors; A form designer, who develops the form structure and keys, and a form user, who fills out form values based on the provided keys. Hence, the form values may not be aligned with the form designer's intention (structure and keys) if a form user gets confused. In this paper, we introduce Form-NLU, the first novel dataset for form structure understanding and its key and value information extraction, interpreting the form designer's intent and the alignment of user-written value on it. It consists of 857 form images, 6k form keys and values, and 4k table keys and values. Our dataset also includes three form types: digital, printed, and handwritten, which cover diverse form appearances and layouts. We propose a robust positional and logical relation-based form key-value information extraction framework. Using this dataset, Form-NLU, we first examine strong object detection models for the form layout understanding, then evaluate the key information extraction task on the dataset, providing fine-grained results for different types of forms and keys. Furthermore, we examine it with the off-the-shelf pdf layout extraction tool and prove its feasibility in real-world cases.
翻译:相对于一般的文档分析任务,表单文档的结构理解和检索是具有挑战性的。表单文档通常由两种类型的作者制作:表单设计师开发表单结构和键,并且表单用户根据提供的键填写表单值。因此,如果表单用户感到困惑,则表单值可能与表单设计师的意图(结构和键)不一致。在本文中,我们介绍了Form-NLU,用于表单结构理解及其键和值信息提取的第一个新型数据集,它解释了表单设计师的意图以及用户书写的值在其上的对齐。其中包含857个表单图像、6k个表单键和值以及4k个表格键和值。我们的数据集还包括三种表单类型:数字、印刷和手写,涵盖了多种表单外观和布局。我们提出了一种强大的基于位置和逻辑关系的表单键值信息提取框架。使用这个数据集,我们首先检验了适用于表单布局理解的强大物体检测模型,然后评估了该数据集上的键信息提取任务,为不同类型的表单和键提供了细致的结果。此外,我们还使用现成的pdf布局提取工具检查了其在实际案例中的可行性。