Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread of pre-training models for NLP applications, they almost focused on text-level manipulation, while neglecting the layout and style information that is vital for document image understanding. In this paper, we propose the LayoutLM to jointly model the interaction between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage the image features to incorporate the visual information of words into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42). The code and pre-trained LayoutLM models are publicly available at https://github.com/microsoft/unilm/tree/master/layoutlm.
翻译:近年来,培训前技术在各种国家语言方案任务中都得到了成功验证,尽管国家语言方案应用程序的培训前模式十分广泛,但它们几乎侧重于文本一级的操作,而忽略了对文件图像理解至关重要的布局和风格信息,在本文件中,我们建议布局LM将文字和布局信息在扫描文件图像中共同建模,这有益于从扫描文件中提取信息等大量真实世界文件图像理解任务;此外,我们还利用图像特征将文字的视觉信息纳入布局LM。 据我们所知,这是文本和布局首次在文件一级培训前的单一框架内联合学习,在一些下游任务中取得新的最新成果,包括形式理解(70.72至79.27)、接收理解(94.02至95.24)和文件图像分类(93.07至94.42),代码和预先培训的布局LM模型可公开查阅https://github.com/microfty/unilm/tre/master/layoutlm)。