Understanding the scene is often essential for reading text in real-world scenarios. However, current scene text recognizers operate on cropped text images, unaware of the bigger picture. In this work, we harness the representative power of recent vision-language models, such as CLIP, to provide the crop-based recognizer with scene, image-level information. Specifically, we obtain a rich representation of the entire image and fuse it with the recognizer word-level features via cross-attention. Moreover, a gated mechanism is introduced that gradually shifts to the context-enriched representation, enabling simply fine-tuning a pretrained recognizer. We implement our model-agnostic framework, named CLIPTER - CLIP Text Recognition, on several leading text recognizers and demonstrate consistent performance gains, achieving state-of-the-art results over multiple benchmarks. Furthermore, an in-depth analysis reveals improved robustness to out-of-vocabulary words and enhanced generalization in low-data regimes.
翻译:在现实世界的情景中,理解场景往往对阅读文本至关重要。然而,当前场景文本识别器运行于裁剪的文本图像上,并不了解大局。在这项工作中,我们利用最近的视觉语言模型(如CLIP)的代表性力量,向作物识别器提供场景和图像级信息。具体地说,我们获得了整个图像的丰富代表性,并通过交叉注意将其与识别器的字级特征融合起来。此外,还引入了一种门式机制,逐渐转向上层内容丰富的表达器,从而能够简单地微调一个预先培训过的识别器。我们在若干主要文本识别器上实施了我们称为CLIPTER-CLIP文本识别的模型认知框架,并展示了一致的绩效收益,在多个基准上取得了最新的结果。此外,一项深入分析显示,在超越词汇的词汇和增强低数据系统中的普及性更加强健。