从表格类文件中提取的数据效率信息 (Data-Efficient Information Extraction from Form-Like Documents)

Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge is that form-like documents in these business workflows can be laid out in virtually infinitely many ways; hence, a good solution to this problem should generalize to documents with unseen layouts and languages. A solution to this problem requires a holistic understanding of both the textual segments and the visual cues within a document, which is non-trivial. While the natural language processing and computer vision communities are starting to tackle this problem, there has not been much focus on (1) data-efficiency, and (2) ability to generalize across different document types and languages. In this paper, we show that when we have only a small number of labeled documents for training (~50), a straightforward transfer learning approach from a considerably structurally-different larger labeled corpus yields up to a 27 F1 point improvement over simply training on the small corpus in the target domain. We improve on this with a simple multi-domain transfer learning approach, that is currently in production use, and show that this yields up to a further 8 F1 point improvement. We make the case that data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types, and learning good representations is critical to accomplishing this.

翻译：从类似表格的文件中自动提取信息是一项紧迫的需要,因为它对金融服务、保险和医疗保健等许多行业的商业工作流程自动化的潜在影响。关键的挑战在于这些商业工作流程中的类似格式的文件可以几乎无穷无尽地以多种方式提供;因此,这一问题的妥善解决办法应该概括为具有看不见的布局和语言的文件。解决这个问题需要全面理解文本部分和文件内的直观提示,这是非三重性的。虽然自然语言处理和计算机视觉社区正在开始解决这一问题,但并没有十分注重(1)数据效率,(2)在不同文件类型和语言中推广类似格式文件的能力。在本文件中,我们表明,当我们只有少量的标签文件用于培训时(~50),从结构上差别很大的大标记的文具上直接的学习方法将比目标领域的小文体培训高出27F1点。我们在这方面改进了简单的多面传输学习方法,目前用于生产,而目前没有多少关注数据效率,我们从这个关键格式提升到精确的F1级格式,我们从这一格式上得出了数百种不同的数据。

相关内容

信息抽取

关注 350

信息抽取（Information Extraction: IE）是把文本里包含的信息进行结构化处理，变成表格一样的组织形式。输入信息抽取系统的是原始文本，输出的是固定格式的信息点。信息点从各种各样的文档中被抽取出来，然后以统一的形式集成在一起。这就是信息抽取的主要任务。信息以统一的形式集成在一起的好处是方便检查和比较。信息抽取技术并不试图全面理解整篇文档，只是对文档中包含相关信息的部分进行分析。至于哪些信息是相关的，那将由系统设计时定下的领域范围而定。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日