Existing text recognition methods usually need large-scale training data. Most of them rely on synthetic training data due to the lack of annotated real images. However, there is a domain gap between the synthetic data and real data, which limits the performance of the text recognition models. Recent self-supervised text recognition methods attempted to utilize unlabeled real images by introducing contrastive learning, which mainly learns the discrimination of the text images. Inspired by the observation that humans learn to recognize the texts through both reading and writing, we propose to learn discrimination and generation by integrating contrastive learning and masked image modeling in our self-supervised method. The contrastive learning branch is adopted to learn the discrimination of text images, which imitates the reading behavior of humans. Meanwhile, masked image modeling is firstly introduced for text recognition to learn the context generation of the text images, which is similar to the writing behavior. The experimental results show that our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets. Moreover, our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11 benchmarks, with similar model size. We also demonstrate that our pre-trained model can be easily applied to other text-related tasks with obvious performance gain. The code is available at https://github.com/ayumiymk/DiG.
翻译:现有文本识别方法通常需要大规模的培训数据。 多数方法都依赖合成培训数据, 原因是缺少附加注释的真实图像。 但是, 合成数据和真实数据之间存在领域差距, 限制了文本识别模型的性能。 最近自我监督的文本识别方法试图通过引入对比性学习来使用未贴标签的真实图像, 主要是学习文本图像的区别性。 实验结果表明, 人类通过阅读和写作学习识别文本的方法, 我们建议通过将对比性学习和掩码图像建模纳入我们自我监督的方法来学习歧视和生成。 对比性学习部门被用来学习文本图像的区别性, 这限制了文本识别模型的性能。 同时, 蒙面图像识别方法试图使用未经贴近的文本图像生成背景, 这与写作行为相似。 实验结果显示, 我们的方法比先前的版本文本识别数据集高出10.2%-20.2%。 此外, 我们提议的文本识别模型超过了先前的状态/ 图像模型, 模仿图像模型可以轻松地显示我们先前的文本识别方法。 我们通过普通的文本识别方法, 也以相同的5.3级的文本识别方法 。</s>