We present TableBank, a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet. Existing research for image-based table detection and recognition usually fine-tunes pre-trained models on out-of-domain data with a few thousand human-labeled examples, which is difficult to generalize on real-world applications. With TableBank that contains 417K high quality labeled tables, we build several strong baselines using state-of-the-art models with deep neural networks. We make TableBank publicly available and hope it will empower more deep learning approaches in the table detection and recognition task. The dataset and models are available at \url{https://github.com/doc-analysis/TableBank}.
翻译:我们提出基于图像的表格检测和识别新数据集,这是一个基于图像的表格检测和识别新数据集,由互联网上的Word和Latex文件进行新的薄弱监管。现有的基于图像的表格检测和识别研究通常以几千个人类标签的例子对外域数据进行微调培训前模型,难以对真实世界应用进行概括化。与包含417K高品质标签表格的表格的表格相比,我们利用具有深层神经网络的先进模型建立了几个强有力的基线。我们让TableBank公开,并希望它能够增强在表格检测和识别工作中的更深层次学习方法。数据集和模型可在以下网站查阅:<url{https://github.com/doc-analymagraphy/TableBank}。