Table of contents (ToC) extraction aims to extract headings of different levels in documents to better understand the outline of the contents, which can be widely used for document understanding and information retrieval. Existing works often use hand-crafted features and predefined rule-based functions to detect headings and resolve the hierarchical relationship between headings. Both the benchmark and research based on deep learning are still limited. Accordingly, in this paper, we first introduce a standard dataset, HierDoc, including image samples from 650 documents of scientific papers with their content labels. Then we propose a novel end-to-end model by using the multimodal tree decoder (MTD) for ToC as a benchmark for HierDoc. The MTD model is mainly composed of three parts, namely encoder, classifier, and decoder. The encoder fuses the multimodality features of vision, text, and layout information for each entity of the document. Then the classifier recognizes and selects the heading entities. Next, to parse the hierarchical relationship between the heading entities, a tree-structured decoder is designed. To evaluate the performance, both the metric of tree-edit-distance similarity (TEDS) and F1-Measure are adopted. Finally, our MTD approach achieves an average TEDS of 87.2% and an average F1-Measure of 88.1% on the test set of HierDoc. The code and dataset will be released at: https://github.com/Pengfei-Hu/MTD.
翻译:目录( ToC) 提取表旨在从文件中摘取不同级别的标题,以更好地了解内容大纲,这些大纲可以广泛用于文件理解和信息检索。现有作品通常使用手工制作的特征和预先定义的基于规则的功能来检测标题和解决标题之间的等级关系。基于深层学习的基准和研究仍然有限。因此,在本文件中,我们首先引入标准数据集HierDoc,包括650份带有内容标签的科学论文文件的图像样本。然后,我们提出一个新的端到端模式,将多式树解码器(MTD)作为HierDoc的基准。MTD模式主要由三个部分组成,即编码器、分类器和解码器。编码将文件每个实体的愿景、文本和布局信息综合起来。然后,分类器确认并选择标题实体。接下来,为分析标题实体之间的等级关系,设计了一个树结构解码模式,作为HiererDoc的标本。为了评估性能,既包括编码编码编码编码编码,又包括编码的编码,在87年的FMILO-MI 和平均测试方法中,将实现。