Document intelligence as a relatively new research topic supports many business applications. Its main task is to automatically read, understand, and analyze documents. However, due to the diversity of formats (invoices, reports, forms, etc.) and layouts in documents, it is difficult to make machines understand documents. In this paper, we present the GraphDoc, a multimodal graph attention-based model for various document understanding tasks. GraphDoc is pre-trained in a multimodal framework by utilizing text, layout, and image information simultaneously. In a document, a text block relies heavily on its surrounding contexts, accordingly we inject the graph structure into the attention mechanism to form a graph attention layer so that each input node can only attend to its neighborhoods. The input nodes of each graph attention layer are composed of textual, visual, and positional features from semantically meaningful regions in a document image. We do the multimodal feature fusion of each node by the gate fusion layer. The contextualization between each node is modeled by the graph attention layer. GraphDoc learns a generic representation from only 320k unlabeled documents via the Masked Sentence Modeling task. Extensive experimental results on the publicly available datasets show that GraphDoc achieves state-of-the-art performance, which demonstrates the effectiveness of our proposed method. The code is available at https://github.com/ZZR8066/GraphDoc.
翻译:作为相对新的研究主题,文档的智能支持了许多商业应用。其主要任务是自动阅读、理解和分析文档。然而,由于格式(发票、报告、表格等)和文档的布局的多样性,很难使机器理解文档。在本文件中,我们展示了基于多式图形关注的模型GreaphDoc,这是用于各种文件理解任务的多式图形关注模型。GreaphDoc通过使用文本、布局和图像信息在多式框架中预先培训。在一个文件中,文本块严重依赖周围环境,因此我们将图形结构输入到关注机制中,形成一个图形关注层,使每个输入节点只能关注其周边。每个图形关注层的输入节点由文字、视觉和位置组成,在文件图像中由具有语义意义的区域构成。我们用门层混合法对每个节点进行多式联运特性的整合。每个节点之间的背景化由图形关注层模拟。GreagoDoc从只有320k无标签的图表层中学习了一种通用的表达方式,因此,每个输入每个输入节点的节点的节点只能覆盖其周围的节点的节点节点节点只能用于其周围的图层的图示。Dochdal-doc 显示我们可用的图形/Rdealdaldal-daldaldaldaldaldaldalddaldaldalddddddddddaldalddddddalddddddddddddaldaldaldaldddddddaldaldaldalddddddddddddddddddddddddddddddddddddddddddaldddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddds