Transformers have shown significant effectiveness for various vision tasks including both high-level vision and low-level vision. Recently, masked autoencoders (MAE) for feature pre-training have further unleashed the potential of Transformers, leading to state-of-the-art performances on various high-level vision tasks. However, the significance of MAE pre-training on low-level vision tasks has not been sufficiently explored. In this paper, we show that masked autoencoders are also scalable self-supervised learners for image processing tasks. We first present an efficient Transformer model considering both channel attention and shifted-window-based self-attention termed CSformer. Then we develop an effective MAE architecture for image processing (MAEIP) tasks. Extensive experimental results show that with the help of MAEIP pre-training, our proposed CSformer achieves state-of-the-art performance on various image processing tasks, including Gaussian denoising, real image denoising, single-image motion deblurring, defocus deblurring, and image deraining.
翻译:近年来,变压器在许多视觉任务中显示出了显著的有效性,包括高级别视觉和低级别视觉。最近,用于特征预训练的遮盖自编码器(MAE)进一步释放了变压器的潜力,从而在各种高级别视觉任务中实现了最先进的性能。但是,MAE预训练在低级别视觉任务中的重要性尚未得到充分探索。在本文中,我们展示了遮盖自编码器也是可扩展的自监督学习器,可用于图像处理任务。我们首先提出了一种高效的变压器模型,考虑通道注意力和基于移位窗口的自我注意力(ReZero机制), 称为CSformer。然后,我们开发了一种有效的用于图像处理的遮盖自编码器架构(MAEIP)。广泛的实验结果表明,在MAEIP预训练的帮助下,我们提出的CSformer在各种图像处理任务中实现了最先进的性能,包括高斯去噪、实际图像去噪、单图像运动去模糊、虚焦去模糊和图像去雨。