In recent years, the dominant paradigm for text spotting is to combine the tasks of text detection and recognition into a single end-to-end framework. Under this paradigm, both tasks are accomplished by operating over a shared global feature map extracted from the input image. Among the main challenges that end-to-end approaches face is the performance degradation when recognizing text across scale variations (smaller or larger text), and arbitrary word rotation angles. In this work, we address these challenges by proposing a novel global-to-local attention mechanism for text spotting, termed GLASS, that fuses together global and local features. The global features are extracted from the shared backbone, preserving contextual information from the entire image, while the local features are computed individually on resized, high-resolution rotated word crops. The information extracted from the local crops alleviates much of the inherent difficulties with scale and word rotation. We show a performance analysis across scales and angles, highlighting improvement over scale and angle extremities. In addition, we introduce an orientation-aware loss term supervising the detection task, and show its contribution to both detection and recognition performance across all angles. Finally, we show that GLASS is general by incorporating it into other leading text spotting architectures, improving their text spotting performance. Our method achieves state-of-the-art results on multiple benchmarks, including the newly released TextOCR.
翻译:近年来,文本定位的主要范式是将文本检测和识别任务合并成一个单一端到端框架。在这个范式下,这两项任务都是通过从输入图像中提取的共享全球地貌地图完成的。端到端方法面临的主要挑战包括:在承认各种规模变异(小或大文本)和任意单词旋转角度的文本时,性能退化;在这项工作中,我们通过提出一个新的全球到地方的注意机制来应对这些挑战,即将文本识别和识别的任务结合到一个全球和地方的特征。全球特征是从共同的骨干中提取出来的,从整个图像中保存背景信息,而本地特征则在重新规模和高分辨率旋转的单词作物上单独计算。从本地作物中提取的信息缓解了与规模和词旋转有关的内在困难。我们展示了跨规模和角度的业绩分析,突出了规模和角极端的改进。此外,我们引入了一种方向感知损失术语,以监督检测任务,并显示其对于所有角度的探测和识别业绩的贡献。最后,我们从本地的特征是用重新制作的、高分辨率旋转的单标尺作物。最后显示我们GLAAS系统在新版本上的表现。