Scene text image super-resolution (STISR) has been regarded as an important pre-processing task for text recognition from low-resolution scene text images. Most recent approaches use the recognizer's feedback as clues to guide super-resolution. However, directly using recognition clue has two problems: 1) Compatibility. It is in the form of probability distribution, has an obvious modal gap with STISR - a pixel-level task; 2) Inaccuracy. it usually contains wrong information, thus will mislead the main task and degrade super-resolution performance. In this paper, we present a novel method C3-STISR that jointly exploits the recognizer's feedback, visual and linguistical information as clues to guide super-resolution. Here, visual clue is from the images of texts predicted by the recognizer, which is informative and more compatible with the STISR task; while linguistical clue is generated by a pre-trained character-level language model, which is able to correct the predicted texts. We design effective extraction and fusion mechanisms for the triple cross-modal clues to generate a comprehensive and unified guidance for super-resolution. Extensive experiments on TextZoom show that C3-STISR outperforms the SOTA methods in fidelity and recognition performance. Code is available in https://github.com/zhaominyiz/C3-STISR.
翻译:光文本图像超分辨率( STISR) 被视为来自低分辨率现场文本图像的文本识别的重要预处理任务( STISR) 。 多数最近的方法都使用识别者的反馈作为引导超分辨率的线索。 但是, 直接使用识别线索有两个问题:(1) 兼容性。 它以概率分布形式出现, 与STISR( 像素级任务)有明显的模式差异; (2) 不准确性。 它通常包含错误的信息, 从而误导主要任务, 并降低超级分辨率的性能 。 在本文中, 我们提出了一个创新方法 C3- STISR( C3- STISR), 共同利用识别者的反馈、 视觉和语言信息作为引导超级分辨率的线索。 这里, 视觉线索来自识别者预测的文本图像, 信息丰富且更符合STISR任务; 虽然语言线索是由预先训练的字级语言模型生成的, 能够校正文本。 我们设计了三重跨模式的有效提取和聚合机制, 以生成超级分辨率的全面统一指导。 SOBSI3 和ASFSUSDSUDSUDSUDLADLADLADLADISDRASUDSUDSUDSUDSUDLADSUDSUDSUDSUDSUDLAUDSUDSUDSUDLADSUDSUDSUDSUDSUDSUDSUDSUDSUDSUDSUDSUDSUDSUDSUDSUDSUDSUDSUDSUDLA AS