Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present \textit{DocLayNet}, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide baseline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10\% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNet-trained models are more robust and thus the preferred choice for general-purpose document-layout analysis.
翻译:精密文档布局分析是高质量的 PDF 网络文档转换的关键要求。 由于最近公开存在, 大型地面真实数据集, 如 PubLayNet 和 DocBank 等, 深学习模型在布局检测和分割方面证明非常有效。 虽然这些数据集的尺寸足以培训这些模型, 但布局差异性严重不足, 因为它们来自科学文章库, 如 PubMed 和 arXiv 。 因此, 当这些模型应用到更具挑战性和多样性的布局时, 版路段的准确性会大幅下降。 在本文件中, 我们以COCO格式提供新的、 公开的、 隐藏文件的注释数据集。 它包含来自不同数据源的80863个手动附加页, 以显示布局的广泛变异性。 每个PDF页面的布局说明提供了标签框, 并选择了 11 个不同的类别。 DocL Net 还提供了一组双、 三重附加说明的页面。 在本文中, 我们提供了一个经过培训的POL 常规模型, 和 m- dealal deal adal adal ad adal deal deal deal deal deal deal deal deal deal deal deal deal deal deal demoment demomental deal demomental deal deal deal.