One of the most pressing problems in the automated analysis of historical documents is the availability of annotated training data. In this paper, we propose a novel method for the synthesis of training data for semantic segmentation of document images. We utilize clusters found in intermediate features of a StyleGAN generator for the synthesis of RGB and label images at the same time. Our model can be applied to any dataset of scanned documents without the need for manual annotation of individual images, as each model is custom-fit to the dataset. In our experiments, we show that models trained on our synthetic data can reach competitive performance on open benchmark datasets for line segmentation.
翻译:对历史文件进行自动分析的最紧迫问题是提供附加说明的培训数据。在本文件中,我们提出了一种合成文件图像语义分解培训数据的新办法。我们同时利用StyleGAN生成器中间特征中发现的组群合成RGB和标签图像。我们的模型可以应用到扫描文件的任何数据集中,而不必对单个图像进行人工说明,因为每个模型都是适合数据集的定制模型。在我们的实验中,我们显示,在合成数据方面受过培训的模型可以在用于分解线的开放基准数据集上达到竞争性的性能。