Document intelligence automates the extraction of information from documents and supports many business applications. Recent self-supervised learning methods on large-scale unlabeled document datasets have opened up promising directions towards reducing annotation efforts by training models with self-supervised objectives. However, most of the existing document pretraining methods are still language-dominated. We present UDoc, a new unified pretraining framework for document understanding. UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. Each input element is composed of words and visual features from a semantic region of the input document image. An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses, encouraging the representation to model sentences, learn similarities, and align modalities. Extensive empirical analysis demonstrates that the pretraining procedure learns better joint representations and leads to improvements in downstream tasks.
翻译:文件信息自动提取文档信息,并支持许多商业应用。最近,关于大规模无标签文件数据集的自监督学习方法为减少带有自我监督目标的培训模式的批注努力开辟了有希望的方向。但是,大多数现有的文件培训前方法仍然以语言为主。我们提出了一个新的统一的文件理解培训前框架UDoc。UDoc旨在支持大多数文件理解任务,扩大变换器,将多式联运嵌入作为投入。每个输入要素由输入文件图像语义区域的文字和视觉特征组成。UDoc的一个重要特征是,它通过使用三种自监督损失来学习通用的表述,鼓励示范句、学习相似之处和统一模式。广泛的实证分析表明,培训前程序学习了更好的联合表述,并导致下游任务的改进。