Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose \textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. The code and models are publicly available at \url{https://aka.ms/layoutlmv3}.
翻译:培训前自我监督的技术在文件AI中取得了显著进展。 大多数经过培训的多式联运模式都使用隐性语言模型,以学习对文本模式的双向表达方式,但在图像模式的培训前目标方面却有所不同。这种差异增加了多式代表学习的困难。在本文中,我们建议为文件AI提供具有统一文本和图像掩码的预培训多式变体。此外,TapleLMv3还预先接受了单字匹配调整目标的培训,以便通过预测文本单词的相应图像补丁是否被掩盖了来学习跨式调整。简单的统一结构和培训目标使DaptLMv3成为以文字为中心的和以图像为中心的文件AI任务的一个通用的预培训模式。实验结果显示,TapleLMv3不仅在以文字为中心的任务中实现了最先进的表现,包括形式理解、接收理解和文件直观回答,而且还在图像中心任务中的任务中学习跨式调整,例如文件图像分类和文件布局分析。代码和模型可以公开查阅。