面向高保真文本感知图像复原的统一扩散Transformer (Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration)

Text-Aware Image Restoration (TAIR) aims to recover high-quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong generative priors for general image restoration, they often produce text hallucinations in text-centric tasks due to the absence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that integrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an iterative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermediate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT backbone, leveraging its strong representational power, exploit these cues to recover fine-grained textual content while effectively suppressing text hallucinations. Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.

翻译：文本感知图像复原（TAIR）旨在从包含退化文本内容的低质量输入中恢复高质量图像。尽管扩散模型为通用图像复原提供了强大的生成先验，但由于缺乏显式语言知识，它们在以文本为中心的任务中常产生文本幻觉。为解决此问题，我们提出了UniT，一个统一的文本复原框架，它以迭代方式集成了扩散Transformer（DiT）、视觉语言模型（VLM）和文本定位模块（TSM），以实现高保真的文本复原。在UniT中，VLM从退化图像中提取文本内容以提供显式文本指导。同时，基于扩散特征训练的TSM在每一步去噪过程中生成中间OCR预测，使VLM能够在去噪过程中迭代优化其指导。最后，DiT骨干网络利用其强大的表示能力，结合这些线索恢复细粒度文本内容，并有效抑制文本幻觉。在SA-Text和Real-Text基准测试上的实验表明，UniT能够忠实重建退化文本，显著减少幻觉，并在TAIR任务中实现了最先进的端到端F1分数性能。

相关内容

图像复原

关注 5

图像复原（image restoration）即利用退化过程的先验知识，去恢复已被退化图像的本来面目。图像复原技术主要是针对成像过程中的“退化”而提出来的，而成像过程中的“退化”现象主要指成像系统受到各种因素的影响，诸如成像系统的散焦、设备与物体间存在相对运动或者是器材的固有缺陷等，导致图像的质量不能够达到理想要求。

[ICCV2025]EAMamba：面向图像恢复的高效全能视觉状态空间模型

专知会员服务

5+阅读 · 7月1日

【CVPR2024】ViewDiff: 3D一致的图像生成与文本到图像模型

专知会员服务

30+阅读 · 2024年3月10日

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日

【ACL2020-CMU-Google】MobileBERT:用于资源受限设备的任务无关“瘦版”BERT

专知会员服务

13+阅读 · 2020年4月9日