Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding~(VDU). While various vision-language pre-training objectives are studied in existing solutions, the document textline, as an intrinsic granularity in VDU, has seldom been explored so far. A document textline usually contains words that are spatially and semantically correlated, which can be easily obtained from OCR engines. In this paper, we propose Wukong-Reader, trained with new pre-training objectives to leverage the structural knowledge nested in document textlines. We introduce textline-region contrastive learning to achieve fine-grained alignment between the visual regions and texts of document textlines. Furthermore, masked region modeling and textline-grid matching are also designed to enhance the visual and layout representations of textlines. Experiments show that our Wukong-Reader has superior performance on various VDU tasks such as information extraction. The fine-grained alignment over textlines also empowers Wukong-Reader with promising localization ability.
翻译:对数百万个数字出生或扫描文件的未经监督的预培训显示,在视觉文件理解~(VDU)方面,出现了有希望的进展。虽然在现有的解决方案中研究各种视觉语言预培训目标,但作为VDU内在颗粒的文献文本目前很少被探讨。文档文本通常包含空间和语系关联的单词,这些单词可以很容易地从OCR引擎获得。在本文中,我们建议Wukong-Reader, 接受新的培训前目标培训,以利用文件文本线中嵌套的结构知识。我们引入了文字-区域对比学习,以实现视觉区域和文件文本之间的细微调整。此外,遮蔽区域模型和文字网格匹配的设计也是为了加强文字线的视觉和布局表达。实验显示,我们的Wukong-Reader在诸如信息提取等各种VDU任务上表现优异。对文本线的精细调整也赋予Wukong-Reader有希望的本地化能力。