We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark (DUE).
翻译:我们提议采用通用文件处理(UDOP)这一基础文件AI模式,统一文本、图像和布局模式,并采用不同的任务格式,包括文件理解和生成;UDOP利用文本内容和文件图像之间的空间相关性来模拟图像、文本和布局模式,采用一个统一代表模式;用一个新的愿景-文本-拉尤特变形器,UDOP将预先培训和多领域下游任务统一成一个基于迅速的顺序生成计划;UDOP使用创新的自我监督目标和多种标签数据,对大型无标签文件公司进行预先培训。UDOP还学习通过蒙面图像的重建从文本和布局模式生成文件图像。根据我们的最佳知识,这是在文件AI领域的第一个时间,一个模式同时实现高质量的神经文件编辑和内容定制。我们的方法将最新技术设置在9个文件AI任务上,例如文件理解和QA,跨多个数据领域,如财务报告、学术论文和网站。UDOP在文件理解基准的首排在文件领导栏上。