This paper presents a new hierarchical vision Transformer for image style transfer, called Strips Window Attention Transformer (S2WAT), which serves as an encoder of encoder-transfer-decoder architecture. With hierarchical features, S2WAT can leverage proven techniques in other fields of computer vision, such as feature pyramid networks (FPN) or U-Net, to image style transfer in future works. However, the existing window-based Transformers will cause a problem that the stylized images will be grid-like when introduced into image style transfer directly. To solve this problem, we propose S2WAT whose representation is computed with Strips Window Attention (SpW Attention). The SpW Attention can integrate both local information and long-range dependencies in horizontal and vertical directions by a novel feature fusion scheme named Attn Merge. Qualitative and quantitative experiments demonstrate that S2WAT achieves comparable performance to state-of-the-art CNN-based, Flow-based, and Transformer-based approaches. The code and models are available at https://github.com/AlienZhang1996/S2WAT.
翻译:本文展示了一个新的图像风格传输的等级式视觉变换器,称为“条形窗口注意变换器(S2WAT)”,它充当了编码器转换代码结构的编码器。有等级特征,S2WAT可以利用其他计算机视觉领域(例如地貌金字塔网络(FPN)或U-Net)的经证实的技术在未来工程中进行图像风格转换。然而,基于窗口的现有变换器将造成一个问题,即当直接引入图像样式转换时,板状图像将类似于网格。为了解决这个问题,我们建议S2WAT采用“条窗口注意”来计算其代表。S2WAT,SWAT可以通过名为“Attn Merge”的新型特征组合计划,将本地信息和远程依赖纳入横向和纵向方向。定性和定量实验表明S2WAT的性能与基于CNN、流基和变换器的状态式图像转换方法相似。该代码和模型见https://github.com/AlienZhang1996/S2WAT。