Text-Aware Image Restoration (TAIR) aims to recover high-quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong generative priors for general image restoration, they often produce text hallucinations in text-centric tasks due to the absence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that integrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an iterative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermediate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT backbone, leveraging its strong representational power, exploit these cues to recover fine-grained textual content while effectively suppressing text hallucinations. Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.
翻译:文本感知图像复原(TAIR)旨在从包含退化文本内容的低质量输入中恢复高质量图像。尽管扩散模型为通用图像复原提供了强大的生成先验,但由于缺乏显式语言知识,它们在以文本为中心的任务中常产生文本幻觉。为解决此问题,我们提出了UniT,一个统一的文本复原框架,它以迭代方式集成了扩散Transformer(DiT)、视觉语言模型(VLM)和文本定位模块(TSM),以实现高保真的文本复原。在UniT中,VLM从退化图像中提取文本内容以提供显式文本指导。同时,基于扩散特征训练的TSM在每一步去噪过程中生成中间OCR预测,使VLM能够在去噪过程中迭代优化其指导。最后,DiT骨干网络利用其强大的表示能力,结合这些线索恢复细粒度文本内容,并有效抑制文本幻觉。在SA-Text和Real-Text基准测试上的实验表明,UniT能够忠实重建退化文本,显著减少幻觉,并在TAIR任务中实现了最先进的端到端F1分数性能。