Existing text recognition methods usually need large-scale training data. Most of them rely on synthetic training data due to the lack of annotated real images. However, there is a domain gap between the synthetic data and real data, which limits the performance of the text recognition models. Recent self-supervised text recognition methods attempted to utilize unlabeled real images by introducing contrastive learning, which mainly learns the discrimination of the text images. Inspired by the observation that humans learn to recognize the texts through both reading and writing, we propose to learn discrimination and generation by integrating contrastive learning and masked image modeling in our self-supervised method. The contrastive learning branch is adopted to learn the discrimination of text images, which imitates the reading behavior of humans. Meanwhile, masked image modeling is firstly introduced for text recognition to learn the context generation of the text images, which is similar to the writing behavior. The experimental results show that our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets. Moreover, our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11 benchmarks, with similar model size. We also demonstrate that our pre-trained model can be easily applied to other text-related tasks with obvious performance gain.
翻译:现有文本识别方法通常需要大规模的培训数据。 多数方法依赖合成培训数据, 原因是缺少注释真实图像。 但是, 合成数据和真实数据之间存在领域差距, 限制了文本识别模型的性能。 最近自我监督的文本识别方法试图通过引入对比性学习来使用未贴标签的真实图像, 主要是学习文本图像的差别性能。 从人们通过阅读和写作学会识别文本的观察中, 我们提议通过将对比性学习和掩码图像模型纳入我们自我监督的方法来学习歧视和生成。 对比性学习分支用于学习文本图像的区别性, 这限制了文本识别模型的性能。 同时, 掩码图像识别方法试图使用未经自我监督的文本识别方法, 这与写作行为相似。 实验结果表明,我们的方法比先前的图像识别方法差强10.2%-20.2%, 在非常规文本识别数据集中,我们提议的文本识别模型超过了先前的状态模型。 我们提议的文本识别模型比先前的要简单, 我们的文本识别方法也比先前的要简单, 我们的文本识别方法要比以前明显。