We present a framework to generate synthetic historical documents with precise ground truth using nothing more than a collection of unlabeled historical images. Obtaining large labeled datasets is often the limiting factor to effectively use supervised deep learning methods for Document Image Analysis (DIA). Prior approaches towards synthetic data generation either require expertise or result in poor accuracy in the synthetic documents. To achieve high precision transformations without requiring expertise, we tackle the problem in two steps. First, we create template documents with user-specified content and structure. Second, we transfer the style of a collection of unlabeled historical images to these template documents while preserving their text and layout. We evaluate the use of our synthetic historical documents in a pre-training setting and find that we outperform the baselines (randomly initialized and pre-trained). Additionally, with visual examples, we demonstrate a high-quality synthesis that makes it possible to generate large labeled historical document datasets with precise ground truth.
翻译:我们提出了一个框架,利用不贴标签的历史图像集来生成具有准确地面真相的合成历史文件。获得大标签数据集往往是限制有效使用有监督的深层文件图像分析(DIA)方法的限制因素。在合成数据生成之前,要么需要专门知识,要么导致合成文件的准确性差。为了在不需要专门知识的情况下实现高精度转换,我们分两个步骤解决这个问题。首先,我们用用户指定的内容和结构来创建模板文件。第二,我们将未贴标签的历史图像集的风格转移到这些模板文件,同时保存其文本和布局。我们在培训前的设置中评估我们合成历史文件的使用情况,发现我们超越了基线(随意初始和预先培训)。此外,我们用视觉实例展示了高质量的合成,从而有可能产生具有准确地面真相的大型有标签的历史文件集。