In this paper, we present a model pretraining technique, named MaskOCR, for text recognition. Our text recognition architecture is an encoder-decoder transformer: the encoder extracts the patch-level representations, and the decoder recognizes the text from the representations. Our approach pretrains both the encoder and the decoder in a sequential manner. (i) We pretrain the encoder in a self-supervised manner over a large set of unlabeled real text images. We adopt the masked image modeling approach, which shows the effectiveness for general images, expecting that the representations take on semantics. (ii) We pretrain the decoder over a large set of synthesized text images in a supervised manner and enhance the language modeling capability of the decoder by randomly masking some text image patches occupied by characters input to the encoder and accordingly the representations input to the decoder. Experiments show that the proposed MaskOCR approach achieves superior results on the benchmark datasets, including Chinese and English text images.
翻译:在本文中,我们展示了一个称为MaskOCR的模型预培训技术,用于文本识别。我们的文本识别结构是一个编码器-解码器变压器:编码器提取了补丁层的表示式,而编码器则识别了代表式的文字。我们的方法是先对编码器和解码器进行顺序排列。 (一)我们对大量未贴标签的真实文本图像进行自我监督的预设。我们采用了蒙面图像建模方法,该方法显示一般图像的功效,预期这些表示式对语义学的效果。 (二)我们以监督下的方式对大量综合文本图像进行解码器进行预设,并通过随机遮盖向编码器输入字符的一些文字图象补,从而增强解码器的文字建模能力,从而对解码器的表示式输入进行自我监督。实验表明,拟议的MaskOCR方法在基准数据集(包括中文和英文文本图象)上取得了优异的结果。