In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.
翻译:在本文中,我们展示了StracTexTv2, 这是一种有效的文件图像培训前框架,通过进行隐蔽的视觉文字预测,这是一个有效的文件图像预设框架。它由两种自我监督的培训前任务组成:隐藏图像建模和隐藏语言建模,以文字区域图像掩码为基础。拟议方法随机根据文字文字词框坐标掩蔽一些图像区域。我们培训前任务的目标是同时重建遮蔽图像区域的像素和相应的遮蔽符号。因此,预先训练的编码器可以捕捉更多的文字语义,而以前通常预测遮蔽图像补的蒙面图像建模。与以图像和文字模式为依托的遮蔽多模式文件图像理解方法相比, StrucTexTv2 模型只图象输入,并可能处理更多不受 OCRR 预处理的应用假想。 有关文件图像理解主流基准的广泛实验展示了StracTexTv2 的功效。在各种图像分类中,它实现了竞争性,甚至新的状态、 CRTR-FL 格式下,在各种图案图象分析中,在各种下图象分析中,实现了具有竞争性的图象结构的图象化。</s>