Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. In this paper, we propose the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42). The code and pre-trained LayoutLM models are publicly available at \url{https://aka.ms/layoutlm}.
翻译:近年来,培训前技术在各种国家语言方案任务中都得到了成功验证,尽管对国家语言方案应用的培训前模式广泛使用,但几乎完全侧重于文本一级的操作,而忽略了对文件图像理解至关重要的布局和风格信息。在本文件中,我们提议通过扫描文件图像,将文本和布局信息之间的联合互动模式纳入扫描文件图像,这有利于大量真实世界文件图像理解任务,例如从扫描文件中提取信息。此外,我们还利用图像功能将文字的视觉信息纳入布局LM。 据我们所知,这是文本和布局首次在文件一级培训前的单一框架内共同学习。它在若干下游任务中取得了新的最新成果,包括形式理解(70.72至79.27)、接收理解(94.02至95.24)和文件图像分类(从93.07至94.42)。代码和事先培训的布局LM模型可在以下网站公开查阅:<url{https://ka.ms/layoutlm}。