We propose SelfDoc, a task-agnostic pre-training framework for document image understanding. Because documents are multimodal and are intended for sequential reading, our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document, and it models the contextualization between each block of content. Unlike existing document pre-training models, our model is coarse-grained instead of treating individual words as input, therefore avoiding an overly fine-grained with excessive contextualization. Beyond that, we introduce cross-modal learning in the model pre-training phase to fully leverage multimodal information from unlabeled documents. For downstream usage, we propose a novel modality-adaptive attention mechanism for multimodal feature fusion by adaptively emphasizing language and vision signals. Our framework benefits from self-supervised pre-training on documents without requiring annotations by a feature masking training strategy. It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
翻译:我们建议“SelfDoc ” ( SelfDoc), 这是一种用于理解文件图像的任务、不可知的训练前框架。 因为文件是多式联运,并打算按顺序阅读。 由于文件是多式的,我们的框架利用了文件中每个具有生理意义的部分的位置、文本和视觉信息,并且将每一部分内容的关联性建模。与现有的培训前模式不同,我们的模式是粗糙的,而不是将单词作为输入,因此避免了过度背景化的过度细化。除此之外,我们引入了示范培训前阶段的跨模式学习,以充分利用来自未贴标签文件的多式联运信息。关于下游的使用,我们提出了一个新的模式适应性地强调语言和视觉信号的多式特征融合适应性关注机制。我们的框架得益于在文件上进行自我监管前培训,而不需要特别掩码培训战略的说明。它实现了多个下游任务的优异性,培训前阶段使用的文件图像比以前的工作要少得多。