A significant portion of the data available today is found within tables. Therefore, it is necessary to use automated table extraction to obtain thorough results when data-mining. Today's popular state-of-the-art methods for table extraction struggle to adequately extract tables with machine-readable text and structural data. To make matters worse, many tables do not have machine-readable data, such as tables saved as images, making most extraction methods completely ineffective. In order to address these issues, a novel, general format table extractor tool, Tablext, is proposed. This tool uses a combination of computer vision techniques and machine learning methods to efficiently and effectively identify and extract data from tables. Tablext begins by using a custom Convolutional Neural Network (CNN) to identify and separate all potential tables. The identification process is optimized by combining the custom CNN with the YOLO object detection network. Then, the high-level structure of each table is identified with computer vision methods. This high-level, structural meta-data is used by another CNN to identify exact cell locations. As a final step, Optical Characters Recognition (OCR) is performed on every individual cell to extract their content without needing machine-readable text. This multi-stage algorithm allows for the neural networks to focus on completing complex tasks, while letting image processing methods efficiently complete the simpler ones. This leads to the proposed approach to be general-purpose enough to handle a large batch of tables regardless of their internal encodings or their layout complexity. Additionally, it becomes accurate enough to outperform competing state-of-the-art table extractors on the ICDAR 2013 table dataset.
翻译:因此,有必要使用自动化表格提取方法,以便在数据挖掘时获得彻底的结果。 今天流行的表格提取最先进的方法, 用机器可读文本和结构数据充分抽取表格。 更糟糕的是, 许多表格没有机器可读的数据, 例如作为图像保存的表格, 使得大多数提取方法完全无效。 为了解决这些问题, 提议了一个新颖的通用格式表格提取工具“ 表选” 。 此工具使用计算机视觉技术和机器学习方法的组合, 以便高效和有效地识别和从表格中提取数据。 今天的表格抽取最先进的方法, 用来充分提取带有机器可读读的文本和结构数据。 使许多表格没有机器可读性的数据, 例如, 将定制的CNN和 YOLO 对象检测网络结合起来, 使每个表格的高层次结构与计算机视觉方法相匹配。 另一种高层次的结构性元数据被另一个CNN用于确定确切的单元格位置。 作为最后一步, 光学字符识别(OCROC) 和从表格中提取数据。 表格的精度, 开始在每一个单个的直径直径直径直径直径直径直径直径直径直径直径网络上,, 开始, 使每个单元格的直径直径直径直径直径直径直径直径直径直的智能的服务器的服务器的服务器能够将进入每个单元格的直径直径直径直径直到直到精度的图像直径直径直径直径直到直到直到直到直到直到直到最直到直到最深的图像。