Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread of pre-training models for NLP applications, they almost focused on text-level manipulation, while neglecting the layout and style information that is vital for document image understanding. In this paper, we propose the LayoutLM to jointly model the interaction between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage the image features to incorporate the visual information of words into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, including receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42). The code and pre-trained LayoutLM models will be available soon at https://github.com/microsoft/unilm/tree/master/layoutlm.
翻译:近年来,培训前技术在各种国家语言方案任务中都得到了成功验证,尽管国家语言方案应用程序的培训前模式十分广泛,但它们几乎侧重于文本一级的操作,而忽略了对文件图像理解至关重要的布局和风格信息;在本文件中,我们建议布局LM将文字和布局信息在扫描文件图像中共同建模互动,这有益于大量真实世界文件图像理解任务,例如从扫描文件中提取信息;此外,我们还利用图像功能将文字的视觉信息纳入布局LM。 据我们所知,这是文本和布局首次在文件一级培训前的单一框架内联合学习,在多个下游任务中取得新的最新成果,包括接收理解(从94.02到95.24)和文件图像分类(从93.07到94.42),代码和经过事先培训的布局LM模型不久将在https://github.com/microlm/tree/master/layoutlm上公布。