Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 $\to$ 0.8420), CORD (0.9493 $\to$ 0.9601), SROIE (0.9524 $\to$ 0.9781), Kleister-NDA (0.8340 $\to$ 0.8520), RVL-CDIP (0.9443 $\to$ 0.9564), and DocVQA (0.7295 $\to$ 0.8672). We made our model and code publicly available at \url{https://aka.ms/layoutlmv2}.
翻译:文本和布局培训前的文字和布局被证明在各种视觉丰富的文件理解任务中是有效的,原因是其有效的模型结构以及大规模无标签的扫描/数字出生文件的优势。我们提议了布局LMv2结构,并配有新的培训前任务,以模拟文本、布局和图像在单一的多模式框架内的相互作用。具体地说,布局LMv2使用双流多式多式变换器,不仅使用现有的遮蔽视觉语言模型任务,而且使用新的文本图像调整和文本图像匹配任务,使它能更好地在培训前阶段捕捉跨模式的互动。同时,它还将一个空间觉察觉自留机制纳入变异结构,以便模型能够充分理解不同文本区之间的相对位置关系。 实验结果表明,布局Lv2通过一个大模型比值,在下游视觉丰富文件理解任务(包括FSDO.7895 美元和0.983美元(0.984美元)。