Until recently, the number of public real-world text images was insufficient for training scene text recognizers. Therefore, most modern training methods rely on synthetic data and operate in a fully supervised manner. Nevertheless, the amount of public real-world text images has increased significantly lately, including a great deal of unlabeled data. Leveraging these resources requires semi-supervised approaches; however, the few existing methods do not account for vision-language multimodality structure and therefore suboptimal for state-of-the-art multimodal architectures. To bridge this gap, we present semi-supervised learning for multimodal text recognizers (SemiMTR) that leverages unlabeled data at each modality training phase. Notably, our method refrains from extra training stages and maintains the current three-stage multimodal training procedure. Our algorithm starts by pretraining the vision model through a single-stage training that unifies self-supervised learning with supervised training. More specifically, we extend an existing visual representation learning algorithm and propose the first contrastive-based method for scene text recognition. After pretraining the language model on a text corpus, we fine-tune the entire network via a sequential, character-level, consistency regularization between weakly and strongly augmented views of text images. In a novel setup, consistency is enforced on each modality separately. Extensive experiments validate that our method outperforms the current training schemes and achieves state-of-the-art results on multiple scene text recognition benchmarks.
翻译:直到最近,公共真实世界文本图像的数量还不足以用于培训现场文本识别者。因此,大多数现代培训方法都依赖合成数据,并以充分监督的方式运作。然而,公共真实世界文本图像的数量最近大幅增加,包括大量未贴标签的数据。利用这些资源需要半监督的方法;然而,少数现有方法没有考虑到愿景语言多式联运结构,因此对最新现代多式联运结构而言并不最理想。为了缩小这一差距,我们为在每种模式培训阶段利用无标签数据的多文本识别器(SemimMTR)提供半监督学习。值得注意的是,我们的方法避免了额外的培训阶段,并维持了目前三个阶段的多式联运培训程序。我们的算法首先通过单阶段培训对愿景模型进行初步培训,使自我监督学习与监督培训相结合。更具体地说,我们推广了现有的视觉代表学习算法,并提出了首个基于对比的文本识别方法。在对语言模型进行预先培训后,我们通过连续性特征级别,对整个网络的薄弱版本进行了严格调整,通过强化了当前版本的文本的常规化,使整个版本的文本更加一致。