Due to the complex layouts of documents, it is challenging to extract information for documents. Most previous studies develop multimodal pre-trained models in a self-supervised way. In this paper, we focus on the embedding learning of word blocks containing text and layout information, and propose UTel, a language model with Unified TExt and Layout pre-training. Specifically, we propose two pre-training tasks: Surrounding Word Prediction (SWP) for the layout learning, and Contrastive learning of Word Embeddings (CWE) for identifying different word blocks. Moreover, we replace the commonly used 1D position embedding with a 1D clipped relative position embedding. In this way, the joint training of Masked Layout-Language Modeling (MLLM) and two newly proposed tasks enables the interaction between semantic and spatial features in a unified way. Additionally, the proposed UTel can process arbitrary-length sequences by removing the 1D position embedding, while maintaining competitive performance. Extensive experimental results show UTel learns better joint representations and achieves superior performance than previous methods on various downstream tasks, though requiring no image modality. Code is available at \url{https://github.com/taosong2019/UTel}.
翻译:由于文件的复杂布局,为文件提取信息具有挑战性。大多数以往的研究都以自我监督的方式开发多式预培训模型。在本文中,我们侧重于嵌入含有文本和布局信息的字块学习,并提议使用UTel(一种语言模型,配有统一 TExt 和布局预培训)。具体地说,我们提议了两项培训前任务:用于布局学习的环绕字形预测(SWP)和用于查找不同字形块的单词嵌入(CWE)对比学习。此外,我们用1D剪接相对位置嵌入的通用1D位置替换了通常使用的 1D 位置。通过这种方式,对MDLLLM(MLM) 和两个新提议的任务进行联合培训,使得语义和空间特征之间能够以统一的方式互动。此外,拟议的UTel(S)可以通过删除1D 位置嵌入,同时保持竞争性的性能。广度实验结果显示UTel在各种下游任务上比以往的方法学习更好的联合表现并取得优异性业绩,但不需要图像模式。 aml/com 。