Single image super-resolution (SISR) has witnessed great strides with the development of deep learning. However, most existing studies focus on building more complex networks with a massive number of layers. Recently, more and more researchers start to explore the application of Transformer in computer vision tasks. However, the heavy computational cost and high GPU memory occupation of the vision Transformer cannot be ignored. In this paper, we propose a novel Efficient Super-Resolution Transformer (ESRT) for SISR. ESRT is a hybrid model, which consists of a Lightweight CNN Backbone (LCB) and a Lightweight Transformer Backbone (LTB). Among them, LCB can dynamically adjust the size of the feature map to extract deep features with a low computational cost. LTB is composed of a series of Efficient Transformers (ET), which occupies a small GPU memory occupation, thanks to the specially designed Efficient Multi-Head Attention (EMHA). Extensive experiments show that ESRT achieves competitive results with low computational costs. Compared with the original Transformer which occupies 16,057M GPU memory, ESRT only occupies 4,191M GPU memory. All codes are available at https://github.com/luissen/ESRT.
翻译:在深层学习过程中,单一图像超分辨率(SISR)取得了长足的进步。然而,大多数现有研究都侧重于建设具有大量层次的更复杂网络。最近,越来越多的研究人员开始探索计算机视觉任务中的变异器应用。然而,不能忽视视觉变异器的计算成本高和高GPU内存占用率高的庞大问题。在本文中,我们提议为SISR提供一个新型的高效超级分辨率变异器(ESRT)。ESRT是一个混合模型,由轻量CNN Backbone(LCB)和轻量级变异器后骨(LTB)组成。其中,LCB可以动态调整功能图的大小,以低计算成本提取深度特征。LCB是由一系列高效的变异器组成,由于专门设计的高效的多层关注(EMHA)。广泛的实验表明,ESRT取得了竞争性的结果,计算成本低。与最初的变异器相比,它只包含16,05MGPU记忆,ESRT只占据4,191M/GPERM。所有代码都在 AM/GPERVS/GMRMRY。所有可用的代码都在 ALVDERVDS/T上。