Information extraction from document images has received a lot of attention recently, due to the need for digitizing a large volume of unstructured documents such as invoices, receipts, bank transfers, etc. In this paper, we propose a novel deep learning architecture for end-to-end information extraction on the 2D character-grid embedding of the document, namely the \textit{Multi-Stage Attentional U-Net}. To effectively capture the textual and spatial relations between 2D elements, our model leverages a specialized multi-stage encoder-decoders design, in conjunction with efficient uses of the self-attention mechanism and the box convolution. Experimental results on different datasets show that our model outperforms the baseline U-Net architecture by a large margin while using 40\% fewer parameters. Moreover, it also significantly improved the baseline in erroneous OCR and limited training data scenario, thus becomes practical for real-world applications.
翻译:最近,由于需要将大量非结构化文件,如发票、收据、银行转账等,从文件图像中提取的信息引起了许多注意。在本文件中,我们提出了一个新的深层次的学习结构,用于对文件的2D字符网嵌入进行端到端的信息提取,即:\ textit{Multi-Statage Contental U-Net}。为了有效地捕捉2D元素之间的文字和空间关系,我们的模型利用了专门的多阶段编码器-解码器设计,同时有效利用了自留机制和盒式组合。不同数据集的实验结果显示,我们的模型在使用40 ⁇ 更少的参数的同时,大大优于基线U-Net结构。此外,它还大大改进了错误的OCR的基线和有限的培训数据假设,因此对现实世界应用来说是实用的。