Recently, significant progress has been made applying machine learning to the problem of table structure inference and extraction from unstructured documents. However, one of the greatest challenges remains the creation of datasets with complete, unambiguous ground truth at scale. To address this, we develop a new, more comprehensive dataset for table extraction, called PubTables-1M. PubTables-1M contains nearly one million tables from scientific articles, supports multiple input modalities, and contains detailed header and location information for table structures, making it useful for a wide variety of modeling approaches. It also addresses a significant source of ground truth inconsistency observed in prior datasets called oversegmentation, using a novel canonicalization procedure. We demonstrate that these improvements lead to a significant increase in training performance and a more reliable estimate of model performance at evaluation for table structure recognition. Further, we show that transformer-based object detection models trained on PubTables-1M produce excellent results for all three tasks of detection, structure recognition, and functional analysis without the need for any special customization for these tasks. Data and code will be released at https://github.com/microsoft/table-transformer.
翻译:最近,在对表格结构推断和从非结构化文件中提取数据的问题应用机器学习方面取得了显著进展,然而,最大的挑战之一仍然是建立具有全面、明确地面事实的数据集,为此,我们开发了一个新的、更全面的表格提取数据集,称为Pubtables-1M。 Pubtables-1M, 包含科学文章中的近100万个表格,支持多种输入模式,并包含表格结构的详细页眉和位置信息,使表格结构可用于多种建模方法。它还涉及一个重要来源,即使用新颖的罐头化程序,在以前称为过度分隔的数据集中观察到的地面真相不一致。我们证明,这些改进导致培训业绩的显著提高,在对表格结构确认进行评估时,对模型性能作出更可靠的估计。此外,我们表明,在普布表-1M培训的基于变压器的物体探测模型,为所有三项任务,即探测、结构识别和功能分析,都产生了极好的结果,无需为这些任务作任何特殊的定制。数据和代码将在https://githbub.com/cromasology/tal-trafttransforystrystry。