Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge is that form-like documents in these business workflows can be laid out in virtually infinitely many ways; hence, a good solution to this problem should generalize to documents with unseen layouts and languages. A solution to this problem requires a holistic understanding of both the textual segments and the visual cues within a document, which is non-trivial. While the natural language processing and computer vision communities are starting to tackle this problem, there has not been much focus on (1) data-efficiency, and (2) ability to generalize across different document types and languages. In this paper, we show that when we have only a small number of labeled documents for training (~50), a straightforward transfer learning approach from a considerably structurally-different larger labeled corpus yields up to a 27 F1 point improvement over simply training on the small corpus in the target domain. We improve on this with a simple multi-domain transfer learning approach, that is currently in production use, and show that this yields up to a further 8 F1 point improvement. We make the case that data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types, and learning good representations is critical to accomplishing this.
翻译:从类似表格的文件中自动提取信息是一项紧迫的需要,因为它对金融服务、保险和医疗保健等许多行业的商业工作流程自动化的潜在影响。关键的挑战在于这些商业工作流程中的类似格式的文件可以几乎无穷无尽地以多种方式提供;因此,这一问题的妥善解决办法应该概括为具有看不见的布局和语言的文件。解决这个问题需要全面理解文本部分和文件内的直观提示,这是非三重性的。虽然自然语言处理和计算机视觉社区正在开始解决这一问题,但并没有十分注重(1)数据效率,(2)在不同文件类型和语言中推广类似格式文件的能力。在本文件中,我们表明,当我们只有少量的标签文件用于培训时(~50),从结构上差别很大的大标记的文具上直接的学习方法将比目标领域的小文体培训高出27F1点。我们在这方面改进了简单的多面传输学习方法,目前用于生产,而目前没有多少关注数据效率,我们从这个关键格式提升到精确的F1级格式,我们从这一格式上得出了数百种不同的数据。