Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes. Recent STR models benefit from taking linguistic information in addition to visual cues into consideration. We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information. Our encoder is a Vision Transformer, and our decoder is a multi-modal Transformer. MVLT is trained in two stages: in the first stage, we design a STR-tailored pretraining method based on a masking strategy; in the second stage, we fine-tune our model and adopt an iterative correction method to improve the performance. MVLT attains superior results compared to state-of-the-art STR models on several benchmarks. Our code and model are available at https://github.com/onealwj/MVLT.
翻译:场景文字识别(STR)使计算机能够在各种现实世界的场景中识别和阅读文字。最近的STR模型在考虑视觉提示之外,还从语言信息中受益。我们提议了一部新颖的蒙面视觉语言变异器(MVLT),以捕捉直隐语言信息。我们的编码器是一个视野变异器,我们的解码器是一个多式变异器。MVLT分两个阶段接受培训:第一阶段,我们设计一种基于遮罩战略的STR定制的训练前方法;第二阶段,我们调整我们的模型,并采用迭代校正方法来改进性能。MVLT取得了优于若干基准上最先进的TRT模型的结果。我们的代码和模型可在https://github.com/onealwj/MVLT查阅。