This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer. These baseline models were applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset and baselines are available at https://github.com/rossumai/docile.
翻译:本文件介绍了具有重要信息本地化、提取和直线项目识别任务最大商业文件数据集的DocILE基准,该基准包括6.7k附加说明的商业文件、100k合成生成的文件和近~1M无标签的培训前未经监督的文件。数据集是在了解领域和具体任务方面知识的情况下建立的,因此具有以下关键特征:(一) 55个类别的说明,大大超过先前公布的关键信息提取数据集的颗粒值;(二) 直线项目识别是一项非常实用的信息提取任务,其中关键信息必须指定在表格中的项目;(三) 文件来自许多布局,测试集包括零和少发案例以及培训集中常见的布局。基准包含若干基准,包括RoBERTA、TapLMv3和基于DETR的表变异器。这些基准模型应用到DocILE基准的两大任务中,并共享结果,为未来工作提供了一个快速启动点。数据设置和基准可在 https://gistros/simcard.