深学习文档布局的文档域域随机化 (Document Domain Randomization for Deep Learning Document Layout Extraction)

from arxiv, Main paper to appear in ICDAR 2021 (16th International Conference on Document Analysis and Recognition). This version contains additional materials. The associated test data is hosted on IEEE Data Port: http://doi.org/10.21227/326q-bf39

We present document domain randomization (DDR), the first successful transfer of convolutional neural networks (CNNs) trained only on graphically rendered pseudo-paper pages to real-world document segmentation. DDR renders pseudo-document pages by modeling randomized textual and non-textual contents of interest, with user-defined layout and font styles to support joint learning of fine-grained classes. We demonstrate competitive results using our DDR approach to extract nine document classes from the benchmark CS-150 and papers published in two domains, namely annual meetings of Association for Computational Linguistics (ACL) and IEEE Visualization (VIS). We compare DDR to conditions of style mismatch, fewer or more noisy samples that are more easily obtained in the real world. We show that high-fidelity semantic information is not necessary to label semantic classes but style mismatch between train and test can lower model accuracy. Using smaller training samples had a slightly detrimental effect. Finally, network models still achieved high test accuracy when correct labels are diluted towards confusing labels; this behavior hold across several classes.

翻译：我们提出了文件域随机化(DDR),这是首次成功地将进化神经网络(CNNs)转移到仅以图形化的假纸页面上培训到真实世界文件分割的图象化文件;DDR通过模拟随机随机的文本和非文本的感兴趣内容,以用户定义的布局和字体风格制作假文件页面,支持精细类的联合学习;我们用我们的DDR方法从基准CS-150中提取9个文件类,以及两个领域发表的论文,即计算语言协会(ACL)和IEEE视觉化(VIS)的年度会议,显示了竞争性结果。我们比较了DDRDR与风格不匹配的条件,在现实世界中更容易获得的吵闹的样本越来越少或更多。我们表明,高不真实性语系信息对于给语系类贴标签并不必要,但是火车和测试之间的风格不匹配可以降低模型准确性。我们使用较小的培训样本具有轻微的有害影响。最后,当正确的标签被稀释为混淆标签时,网络模型仍然达到很高的测试精准性;这种行为存在于几个类中。