Table extraction is an important but still unsolved problem. In this paper, we introduce a flexible and modular table extraction system. We develop two rule-based algorithms that perform the complete table recognition process, including table detection and segmentation, and support the most frequent table formats. Moreover, to incorporate the extraction of semantic information, we develop a graph-based table interpretation method. We conduct extensive experiments on the challenging table recognition benchmarks ICDAR 2013 and ICDAR 2019, achieving results competitive with state-of-the-art approaches. Our complete information extraction system exhibited a high F1 score of 0.7380. To support future research on information extraction from documents, we make the resources (ground-truth annotations, evaluation scripts, algorithm parameters) from our table interpretation experiment publicly available.
翻译:表格提取是一个重要但仍未解决的问题。 在本文中, 我们引入了一个灵活的模块化表格提取系统。 我们开发了两种基于规则的算法, 进行完整的表格识别过程, 包括表格检测和分割, 并且支持最常用的表格格式。 此外, 为了纳入语义信息的提取, 我们开发了一个基于图表的表格解释方法。 我们在具有挑战性的表格识别基准 ICDAR 2013 和 ICDAR 2019 上进行了广泛的实验, 取得了与最新方法相竞争的成果。 我们完整的信息提取系统显示F1得分高达0. 7380。 为了支持今后关于从文档中提取信息的研究, 我们从表格解释实验中公开了资源( 地面真相说明、 评价脚本、算法参数 ) 。