Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU), especially the fixed-layout documents such as scanned document images. While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based pre-training approaches not easy to apply. In this paper, we propose MarkupLM for document understanding tasks with markup languages as the backbone such as HTML/XML-based documents, where text and markup information is jointly pre-trained. Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks. The pre-trained model and code will be publicly available at https://aka.ms/markuplm.
翻译:具有文字、布局和图像的多式培训前培训在视觉化内容丰富的文件理解(VrDU)方面取得了显著进展,特别是扫描文件图像等固定布置文件。虽然仍有大量数字文件,但布局信息没有固定,需要互动和动态地进行可视化,使得现有的基于布局的预培训方法不容易应用。在本文件中,我们建议标记LM用于文件理解任务,以标记语言作为文件理解任务,如HTML/XML-基于主干文件,其中文本和标记信息是共同预先培训的。实验结果表明,预先培训的马库普LM大大超越了若干文件理解任务的现有强有力的基线模型。预先培训的模式和代码将在https://akas.ms/markuplm上公布。