Recently, image restoration transformers have achieved comparable performance with previous state-of-the-art CNNs. However, how to efficiently leverage such architectures remains an open problem. In this work, we present Dual-former whose critical insight is to combine the powerful global modeling ability of self-attention modules and the local modeling ability of convolutions in an overall architecture. With convolution-based Local Feature Extraction modules equipped in the encoder and the decoder, we only adopt a novel Hybrid Transformer Block in the latent layer to model the long-distance dependence in spatial dimensions and handle the uneven distribution between channels. Such a design eliminates the substantial computational complexity in previous image restoration transformers and achieves superior performance on multiple image restoration tasks. Experiments demonstrate that Dual-former achieves a 1.91dB gain over the state-of-the-art MAXIM method on the Indoor dataset for single image dehazing while consuming only 4.2% GFLOPs as MAXIM. For single image deraining, it exceeds the SOTA method by 0.1dB PSNR on the average results of five datasets with only 21.5% GFLOPs. Dual-former also substantially surpasses the latest desnowing method on various datasets, with fewer parameters.
翻译:最近,图像恢复变压器取得了与先前最先进的CNN的类似性能。 然而,如何高效地利用这些结构架构仍然是一个尚未解决的问题。 在这项工作中,我们展示了具有关键洞察力的双重前导者,他的关键洞察力是将自我注意模块的强大的全球模型能力与总体结构中融合演动的本地模型能力结合起来。在以编码器和解码器安装的基于革命的本地地貌提取模块中,我们只采用了潜层中新型混合变压器块,以模拟空间层面的长距离依赖度,并处理不同频道之间的分布不均。这样的设计消除了先前图像恢复变异器中的重大计算复杂性,并实现了多个图像恢复任务上的优异性性。实验表明,在单个图像解裂的室内数据集中,基于以最先进的 MAXIM 方法获得1.91DB的收益,而仅消耗4.2%的GFLOP 。对于单一图像脱线,它比SOTA方法高出0.1dB PSNRR, 在五个图像恢复变异性参数的平均结果上, 也大大缩小了GFLO5 。