One of the most pressing problems in the automated analysis of historical documents is the availability of annotated training data. The problem is that labeling samples is a time-consuming task because it requires human expertise and thus, cannot be automated well. In this work, we propose a novel method to construct synthetic labeled datasets for historical documents where no annotations are available. We train a StyleGAN model to synthesize document images that capture the core features of the original documents. While originally, the StyleGAN architecture was not intended to produce labels, it indirectly learns the underlying semantics to generate realistic images. Using our approach, we can extract the semantic information from the intermediate feature maps and use it to generate ground truth labels. To investigate if our synthetic dataset can be used to segment the text in historical documents, we use it to train multiple supervised segmentation models and evaluate their performance. We also train these models on another dataset created by a state-of-the-art synthesis approach to show that the models trained on our dataset achieve better results while requiring even less human annotation effort.
翻译:自动分析历史文件的最紧迫问题是提供附加说明的培训数据。 问题在于标签样本是一个耗时的任务,因为它需要人的专门知识,因此不可能实现自动化。 在这项工作中,我们提出一种新的方法,在没有说明的情况下,为历史文件构建合成标签数据集;我们培训StyGAN模型,以综合记录原始文件的核心特征的文档图像。虽然StyleGAN结构最初不打算制作标签,但它间接地学习基本语义,以生成现实的图像。我们可以使用我们的方法,从中间地物图中提取语义信息,并利用它生成地面的真相标签。为了调查我们的合成数据集能否用于在历史文件中分割文本,我们用它来培训多个受监督的分解模型并评估其性能。我们还将这些模型放在另一个由最新综合方法创建的数据集上,以显示在我们的数据集上培训的模型取得更好的结果,而不需要人文的注解努力。