Recent efforts of multimodal Transformers have improved Visually Rich Document Understanding (VrDU) tasks via incorporating visual and textual information. However, existing approaches mainly focus on fine-grained elements such as words and document image patches, making it hard for them to learn from coarse-grained elements, including natural lexical units like phrases and salient visual regions like prominent image regions. In this paper, we attach more importance to coarse-grained elements containing high-density information and consistent semantics, which are valuable for document understanding. At first, a document graph is proposed to model complex relationships among multi-grained multimodal elements, in which salient visual regions are detected by a cluster-based method. Then, a multi-grained multimodal Transformer called mmLayout is proposed to incorporate coarse-grained information into existing pre-trained fine-grained multimodal Transformers based on the graph. In mmLayout, coarse-grained information is aggregated from fine-grained, and then, after further processing, is fused back into fine-grained for final prediction. Furthermore, common sense enhancement is introduced to exploit the semantic information of natural lexical units. Experimental results on four tasks, including information extraction and document question answering, show that our method can improve the performance of multimodal Transformers based on fine-grained elements and achieve better performance with fewer parameters. Qualitative analyses show that our method can capture consistent semantics in coarse-grained elements.
翻译:多式联运变异器最近的努力通过纳入视觉和文字信息改善了视觉上丰富的文档理解(VrDU)任务。然而,现有方法主要侧重于微小的元素,如文字和文件图像补丁,因此很难从粗粗的元素中学习,包括天然的词汇单位,如词组和突出的视觉区域,如突出的图像区域。在本文中,我们更加重视含有高密度信息和一致语义的粗糙元素,这对文件理解很有价值。首先,提议了一个文件图表,以模拟多重度多式联运元素之间的复杂关系,在这些元素中,通过集束法探测出突出的视觉区域。然后,一个称为毫米Layout的多重的多式联运变异器,以图表为基础,将粗粗重的信息纳入现有的精细的捕获型多式联运变异器。在毫米Layout, 粗粗重的变异形信息从微的变异器从精度参数中收集出来,然后,在进一步处理后,将精细的多重的多重的多重的图像区域连接成精细的图像,在最终的图像分析中,包括不断的变异的变异的图像分析。 普通的变现方法,可以用来分析,在最终的变异的变异地分析方法上, 改进我们的变现了我们的变现的变现了我们的变现,在了。