GLYPH-SR：能否通过视觉语言模型引导的潜在扩散模型同时实现高质量图像超分辨率与高保真文本恢复？ (GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model?)

Image super-resolution(SR) is fundamental to many vision system-from surveillance and autonomy to document analysis and retail analytics-because recovering high-frequency details, especially scene-text, enables reliable downstream perception. Scene-text, i.e., text embedded in natural images such as signs, product labels, and storefronts, often carries the most actionable information; when characters are blurred or hallucinated, optical character recognition(OCR) and subsequent decisions fail even if the rest of the image appears sharp. Yet previous SR research has often been tuned to distortion (PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that are largely insensitive to character-level errors. Furthermore, studies that do address text SR often focus on simplified benchmarks with isolated characters, overlooking the challenges of text within complex natural scenes. As a result, scene-text is effectively treated as generic texture. For SR to be effective in practical deployments, it is therefore essential to explicitly optimize for both text legibility and perceptual quality. We present GLYPH-SR, a vision-language-guided diffusion framework that aims to achieve both objectives jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by OCR data, and a ping-pong scheduler that alternates between text- and scene-centric guidance. To enable targeted text restoration, we train these components on a synthetic corpus while keeping the main SR branch frozen. Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR) while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed to satisfy both objectives simultaneously-high readability and high visual realism-delivering SR that looks right and reds right.

翻译：图像超分辨率是许多视觉系统的基础——从监控与自主系统到文档分析与零售分析——因为恢复高频细节（尤其是场景文本）能够实现可靠的下游感知。场景文本，即嵌入自然图像（如标志、产品标签和店面）中的文字，通常携带最具可操作性的信息；当字符模糊或产生幻觉时，即使图像其余部分看起来清晰，光学字符识别及后续决策也会失败。然而，以往的超分辨率研究往往针对失真指标（如PSNR/SSIM）或学习感知指标（如LIPIS、MANIQA、CLIP-IQA、MUSIQ）进行优化，这些指标对字符级误差基本不敏感。此外，涉及文本超分辨率的研究通常聚焦于简化基准测试（如孤立字符），忽视了复杂自然场景中文本的挑战。因此，场景文本实际上被视作通用纹理处理。为使超分辨率在实际部署中有效，必须同时针对文本可读性与感知质量进行显式优化。我们提出了GLYPH-SR，一种视觉语言引导的扩散框架，旨在同时实现这两个目标。GLYPH-SR利用基于OCR数据引导的文本超分辨率融合控制网络，以及一种在文本中心与场景中心引导间交替的乒乓调度器。为实现针对性文本恢复，我们在合成语料上训练这些组件，同时保持主超分辨率分支冻结。在SVT、SCUT-CTW1500和CUTE80数据集上，针对x4和x8倍率，GLYPH-SR相比扩散/GAN基线（SVT x8，OpenOCR）将OCR F1分数提升高达+15.18个百分点，同时保持具有竞争力的MANIQA、CLIP-IQA和MUSIQ分数。GLYPH-SR旨在同时满足高可读性与高视觉真实感双重目标，提供既看起来正确又读起来正确的超分辨率结果。