Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also introduce a multilingual form understanding benchmark dataset named XFUN, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUN dataset. The pre-trained LayoutXLM model and the XFUN dataset will be publicly available at https://aka.ms/layoutxlm.
翻译:最近,通过文字、布局和图像的多式培训前培训,实现了具有丰富视觉文件理解任务的SOTA业绩,这表明了在不同模式下共同学习的巨大潜力。在本文中,我们介绍了MtalXLM,这是多语种文件理解的多语种预先培训模式,旨在弥合理解高视力文件的语言障碍。为了准确评估MtalXLM,我们还引入了名为XFUN的多语种理解基准数据集,其中包括以7种语言(中文、日文、西班牙文、法文、意大利文、德文、葡萄牙文)形成理解样本,并且为每种语言手工标注了关键价值对等。实验结果表明,MtaltraXLM模型大大超过了XFUN数据集上现有的SOTA跨语言预先培训模式。预先培训的版XLM模型和XFUN数据集将在https://aka.ms/layoutxlm上公开提供。