While attention-based transformer networks achieve unparalleled success in nearly all language tasks, the large number of tokens coupled with the quadratic activation memory usage makes them prohibitive for visual tasks. As such, while language-to-language translation has been revolutionized by the transformer model, convolutional networks remain the de facto solution for image-to-image translation. The recently proposed MLP-Mixer architecture alleviates some of the speed and memory issues associated with attention-based networks while still retaining the long-range connections that make transformer models desirable. Leveraging this efficient alternative to self-attention, we propose a new unpaired image-to-image translation model called MixerGAN: a simpler MLP-based architecture that considers long-distance relationships between pixels without the need for expensive attention mechanisms. Quantitative and qualitative analysis shows that MixerGAN achieves competitive results when compared to prior convolutional-based methods.
翻译:虽然几乎所有语言任务都以关注为基础的变压器网络都取得了前所未有的成功,但大量的象征物加上二次激活记忆的使用使得这些象征物无法用于视觉任务。 因此,虽然变压器模型使语言对语言的翻译发生了革命性的变化,但革命网络仍然是图像到图像翻译的实际解决方案。 最近提议的MLP-混合器结构缓解了与关注网络有关的一些速度和记忆问题,同时仍然保留了使变压器模型更可取的远程连接。 利用这一高效的替代物来取代自我关注,我们提出了一个新的称为MixerGAN的未受保护的图像到图像翻译模型:一个基于MixerGAN的更简单的MLP结构,它考虑到像素之间的长距离关系,而不需要昂贵的注意机制。定量和定性分析表明,与先前的变压法方法相比,MixerGAN取得了竞争性的结果。