按职等嵌入层和多级关注多级U-Net 抽取端到端信息 (End-to-End Information Extraction by Character-Level Embedding and Multi-Stage Attentional U-Net)

Information extraction from document images has received a lot of attention recently, due to the need for digitizing a large volume of unstructured documents such as invoices, receipts, bank transfers, etc. In this paper, we propose a novel deep learning architecture for end-to-end information extraction on the 2D character-grid embedding of the document, namely the \textit{Multi-Stage Attentional U-Net}. To effectively capture the textual and spatial relations between 2D elements, our model leverages a specialized multi-stage encoder-decoders design, in conjunction with efficient uses of the self-attention mechanism and the box convolution. Experimental results on different datasets show that our model outperforms the baseline U-Net architecture by a large margin while using 40\% fewer parameters. Moreover, it also significantly improved the baseline in erroneous OCR and limited training data scenario, thus becomes practical for real-world applications.

翻译：最近,由于需要将大量非结构化文件,如发票、收据、银行转账等,从文件图像中提取的信息引起了许多注意。在本文件中,我们提出了一个新的深层次的学习结构,用于对文件的2D字符网嵌入进行端到端的信息提取,即:\ textit{Multi-Statage Contental U-Net}。为了有效地捕捉2D元素之间的文字和空间关系,我们的模型利用了专门的多阶段编码器-解码器设计,同时有效利用了自留机制和盒式组合。不同数据集的实验结果显示,我们的模型在使用40 ⁇ 更少的参数的同时,大大优于基线U-Net结构。此外,它还大大改进了错误的OCR的基线和有限的培训数据假设,因此对现实世界应用来说是实用的。

相关内容

信息抽取

关注 350

信息抽取（Information Extraction: IE）是把文本里包含的信息进行结构化处理，变成表格一样的组织形式。输入信息抽取系统的是原始文本，输出的是固定格式的信息点。信息点从各种各样的文档中被抽取出来，然后以统一的形式集成在一起。这就是信息抽取的主要任务。信息以统一的形式集成在一起的好处是方便检查和比较。信息抽取技术并不试图全面理解整篇文档，只是对文档中包含相关信息的部分进行分析。至于哪些信息是相关的，那将由系统设计时定下的领域范围而定。

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日