We present MATrIX - a Modality-Aware Transformer for Information eXtraction in the Visual Document Understanding (VDU) domain. VDU covers information extraction from visually rich documents such as forms, invoices, receipts, tables, graphs, presentations, or advertisements. In these, text semantics and visual information supplement each other to provide a global understanding of the document. MATrIX is pre-trained in an unsupervised way with specifically designed tasks that require the use of multi-modal information (spatial, visual, or textual). We consider the spatial and text modalities all at once in a single token set. To make the attention more flexible, we use a learned modality-aware relative bias in the attention mechanism to modulate the attention between the tokens of different modalities. We evaluate MATrIX on 3 different datasets each with strong baselines.
翻译:我们介绍了 MATrIX -- -- 视觉文档理解(VDU)域的信息提取模式-软件变换器。 VDU 包括从表格、发票、收据、表格、图表、演示文稿或广告等视觉丰富的文件提取的信息,在这些文件中,文字语义和视觉信息互为补充,以提供对文件的全球理解。 MATrIX 以不受监督的方式接受预先培训,具体设计的任务需要使用多模式信息(空间、视觉或文字)。我们一次将空间和文本模式放在一个符号集中考虑。为了更加灵活地关注,我们在关注机制中使用一种了解模式的相对偏差,以调和不同模式符号之间的注意力。我们用强有力的基线对3个不同数据集的 MATrIX 进行了评估。