The surge of pre-training has witnessed the rapid development of document understanding recently. Pre-training and fine-tuning framework has been effectively used to tackle texts in various formats, including plain texts, document texts, and web texts. Despite achieving promising performance, existing pre-trained models usually target one specific document format at one time, making it difficult to combine knowledge from multiple document formats. To address this, we propose XDoc, a unified pre-trained model which deals with different document formats in a single model. For parameter efficiency, we share backbone parameters for different formats such as the word embedding layer and the Transformer layers. Meanwhile, we introduce adaptive layers with lightweight parameters to enhance the distinction across different formats. Experimental results have demonstrated that with only 36.7% parameters, XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models, which is cost effective for real-world deployment. The code and pre-trained models will be publicly available at \url{https://aka.ms/xdoc}.
翻译:培训前的激增见证了对文件理解的迅速发展; 培训前和微调框架被有效地用于处理各种格式的文本,包括简便文本、文件文本和网络文本; 尽管取得了有希望的业绩,但现有的培训前模式通常同时针对一种特定的文件格式,因此难以将多种文件格式的知识结合起来; 为了解决这个问题,我们提议XDoc, 一种统一的培训前模式,以单一模式处理不同的文件格式; 关于参数效率,我们共享不同格式的骨干参数,例如单词嵌入层和变异层。 同时,我们引入了适应性层次,并采用轻量参数,以加强不同格式的区别。实验结果表明,只有36.7%的参数,XDoc在各种下游任务上取得了可比较或更好的业绩,而个别培训前模式则具有成本效益,对现实世界的部署具有成本效益。 代码和预先培训模式将在以下网站公开提供:https://akas/xdoc}。