In this paper, we aim to redesign the vision Transformer (ViT) as a new backbone to realize semantic image transmission, termed wireless image transmission transformer (WITT). Previous works build upon convolutional neural networks (CNNs), which are inefficient in capturing global dependencies, resulting in degraded end-to-end transmission performance especially for high-resolution images. To tackle this, the proposed WITT employs Swin Transformers as a more capable backbone to extract long-range information. Different from ViTs in image classification tasks, WITT is highly optimized for image transmission while considering the effect of the wireless channel. Specifically, we propose a spatial modulation module to scale the latent representations according to channel state information, which enhances the ability of a single model to deal with various channel conditions. As a result, extensive experiments verify that our WITT attains better performance for different image resolutions, distortion metrics, and channel conditions. The code is available at https://github.com/KeYang8/WITT.
翻译:在本文中,我们的目标是重新设计视觉变换器(VIT),作为实现语义图像传输的新主干线,称为无线图像传输变压器(WITT),先前的作品以进化神经网络为基础,这些网络在捕捉全球依赖性方面效率低下,导致终端到终端传输性能退化,特别是高分辨率图像。为了解决这个问题,拟议的WITT采用双式变压器作为获取远程信息的更有能力的主干线。在图像分类任务中,WITT与VT不同,在考虑无线频道的效果时,其图像传输是高度优化的。具体地说,我们提议了一个空间变制模块,以根据国家信息传输方式调整潜在表达方式,从而增强单一模型处理各种频道条件的能力。结果,广泛的实验证实我们的WITT在不同的图像分辨率、扭曲度量度和频道条件方面实现更好的性能。该代码可在https://github.com/KeYang8/WITTTT上查阅。