The problem of document structure reconstruction refers to converting digital or scanned documents into corresponding semantic structures. Most existing works mainly focus on splitting the boundary of each element in a single document page, neglecting the reconstruction of semantic structure in multi-page documents. This paper introduces hierarchical reconstruction of document structures as a novel task suitable for NLP and CV fields. To better evaluate the system performance on the new task, we built a large-scale dataset named HRDoc, which consists of 2,500 multi-page documents with nearly 2 million semantic units. Every document in HRDoc has line-level annotations including categories and relations obtained from rule-based extractors and human annotators. Moreover, we proposed an encoder-decoder-based hierarchical document structure parsing system (DSPS) to tackle this problem. By adopting a multi-modal bidirectional encoder and a structure-aware GRU decoder with soft-mask operation, the DSPS model surpass the baseline method by a large margin. All scripts and datasets will be made publicly available at https://github.com/jfma-USTC/HRDoc.
翻译:文档结构重建问题是将数字或扫描文档转换为相应的语义结构的问题。大多数现有的工作主要集中在单个文档页面中每个元素边界的分割上,忽略了多页文档中语义结构的重建。本文将文档结构的层次重建作为适合NLP和CV领域的一项新任务。为了更好地评估系统在新任务上的性能,我们建立了一个名为HRDoc的大规模数据集,其中包含近200万个语义单元的2,500个多页文档。HRDoc中的每个文档都具有来自基于规则提取器和人类标注者的分类和关系的行级注释。此外,我们提出了一个基于编码器-解码器的层次文档结构解析系统(DSPS)来解决这个问题。通过采用多模式双向编码器和结构感知GRU解码器与软掩模操作,DSPS模型大幅超过了基准方法。所有脚本和数据集将在https://github.com/jfma-USTC上公开。