先恢复文本，后增强图像：基于字形结构引导的两阶段场景文本图像超分辨率 (Restore Text First, Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance)

Current generative super-resolution methods show strong performance on natural images but distort text, creating a fundamental trade-off between image quality and textual readability. To address this, we introduce \textbf{TIGER} (\textbf{T}ext-\textbf{I}mage \textbf{G}uided sup\textbf{E}r-\textbf{R}esolution), a novel two-stage framework that breaks this trade-off through a \textit{"text-first, image-later"} paradigm. \textbf{TIGER} explicitly decouples glyph restoration from image enhancement: it first reconstructs precise text structures and then uses them to guide subsequent full-image super-resolution. This glyph-to-image guidance ensures both high fidelity and visual consistency. To support comprehensive training and evaluation, we also contribute the \textbf{UltraZoom-ST} (UltraZoom-Scene Text), the first scene text dataset with extreme zoom (\textbf{$\times$14.29}). Extensive experiments show that \textbf{TIGER} achieves \textbf{state-of-the-art} performance, enhancing readability while preserving overall image quality.

翻译：当前生成式超分辨率方法在自然图像上表现出强大性能，但会扭曲文本，导致图像质量与文本可读性之间存在根本性权衡。为解决这一问题，我们提出\textbf{TIGER}（\textbf{T}ext-\textbf{I}mage \textbf{G}uided sup\textbf{E}r-\textbf{R}esolution），一种新颖的两阶段框架，通过“先文本，后图像”的范式打破这一权衡。\textbf{TIGER}明确地将字形恢复与图像增强解耦：它首先重建精确的文本结构，然后利用这些结构指导后续的全图像超分辨率。这种从字形到图像的引导机制确保了高保真度和视觉一致性。为支持全面的训练与评估，我们还贡献了\textbf{UltraZoom-ST}（UltraZoom-Scene Text），首个具有极端缩放倍率（\textbf{$\times$14.29}）的场景文本数据集。大量实验表明，\textbf{TIGER}实现了\textbf{最先进的}性能，在提升可读性的同时保持了整体图像质量。