Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. In this paper, we present \textbf{LayoutLMv2} by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks in the pre-training stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms strong baselines and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852), RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). The pre-trained LayoutLMv2 model is publicly available at https://aka.ms/layoutlmv2.
翻译:文本和布局的预培训证明在各种高视力文件理解任务中行之有效。 具体地说, 布局LMv2不仅使用现有的遮蔽视觉模拟任务,而且使用新的文本图像调整和文本图像匹配任务,在培训前阶段,通过大规模无标签的扫描/数字生成文件的优势。 在本文中,我们通过培训前的文本、布局和图像在一个多模式框架内提出\ textbf{LayoutLMv2},在多模式框架中利用新的模型结构和培训前的任务。 实验结果表明,布局Lv2不仅使用现有的遮蔽视觉模拟任务,而且使用新的文本图像调整和文本图像匹配任务,在培训前阶段,跨模式互动会得到更好的了解。 同时,我们还将空间自觉自留机制纳入变器结构,以便该模型能够充分理解不同文本块之间的相对位置关系。 实验结果表明,布局Lv2 超越了当前强的基线,并在广泛的下游高视力文件理解任务中实现新的状态模型结果,包括FUNSD(0.785 > 0.840410), 0.94M.A. 0.994_0.9851。 A. 99851。