Text line segmentation is one of the key steps in historical document understanding. It is challenging due to the variety of fonts, contents, writing styles and the quality of documents that have degraded through the years. In this paper, we address the limitations that currently prevent people from building line segmentation models with a high generalization capacity. We present a study conducted using three state-of-the-art systems Doc-UFCN, dhSegment and ARU-Net and show that it is possible to build generic models trained on a wide variety of historical document datasets that can correctly segment diverse unseen pages. This paper also highlights the importance of the annotations used during training: each existing dataset is annotated differently. We present a unification of the annotations and show its positive impact on the final text recognition results. In this end, we present a complete evaluation strategy using standard pixel-level metrics, object-level ones and introducing goal-oriented metrics.
翻译:文本线条分割是历史文件理解的关键步骤之一。 由于其字体、 内容、 写作风格的多样性和多年来退化的文件的质量, 文本线条分割具有挑战性。 在本文中, 我们处理目前妨碍人们建立具有高度一般化能力的线条分割模型的局限性。 我们介绍了利用三种最先进的系统Doc-UFCN、 dhSectionment 和 ARU-Net进行的一项研究, 并表明有可能建立在各种历史文件数据集方面受过培训的通用模型, 这些数据集可以正确分解不同的隐蔽网页。 本文件还强调了培训期间使用的说明的重要性: 每一个现有数据集都有不同的注释。 我们对说明进行了统一,并展示其对最后文本识别结果的积极影响。 为此, 我们提出了使用标准的像素级测量、 目标级测量和引入面向目标的衡量标准的全面评价战略。