Relevant information in documents is often summarized in tables, helping the reader to identify useful facts. Most benchmark datasets support either document layout analysis or table understanding, but lack in providing data to apply both tasks in a unified way. We define the task of Contextualized Table Extraction (CTE), which aims to extract and define the structure of tables considering the textual context of the document. The dataset comprises 75k fully annotated pages of scientific papers, including more than 35k tables. Data are gathered from PubMed Central, merging the information provided by annotations in the PubTables-1M and PubLayNet datasets. The dataset can support CTE and adds new classes to the original ones. The generated annotations can be used to develop end-to-end pipelines for various tasks, including document layout analysis, table detection, structure recognition, and functional analysis. We formally define CTE and evaluation metrics, showing which subtasks can be tackled, describing advantages, limitations, and future works of this collection of data. Annotations and code will be accessible a https://github.com/AILab-UniFI/cte-dataset.
翻译:多数基准数据集支持文件布局分析或对表格的理解,但缺乏以统一的方式提供用于这两项任务的数据。我们定义了背景化表格提取(CTE)的任务,目的是提取和界定考虑到文件文字背景的表格结构。数据集包括75千张附有充分说明的科学论文,包括超过35千张表格。数据来自普布迈德中央,综合了PubTables-1M和PubLayNet数据集中说明提供的信息。数据集可以支持CTE,并将新类别添加到原始类别。生成的注释可用于为各种任务开发端到端的管道,包括文件布局分析、表检测、结构识别和功能分析。我们正式定义了CTE和评价基准,显示哪些子任务可以处理,说明该数据收集的优点、局限性和未来工作。说明和代码将可访问 https://github.com/AILAB-UNIFI/cte-datas。