While attention-based transformer networks achieve unparalleled success in nearly all language tasks, the large number of tokens (pixels) found in images coupled with the quadratic activation memory usage makes them prohibitive for problems in computer vision. As such, while language-to-language translation has been revolutionized by the transformer model, convolutional networks remain the de facto solution for image-to-image translation. The recently proposed MLP-Mixer architecture alleviates some of the computational issues associated with attention-based networks while still retaining the long-range connections that make transformer models desirable. Leveraging this memory-efficient alternative to self-attention, we propose a new exploratory model in unpaired image-to-image translation called MixerGAN: a simpler MLP-based architecture that considers long-distance relationships between pixels without the need for expensive attention mechanisms. Quantitative and qualitative analysis shows that MixerGAN achieves competitive results when compared to prior convolutional-based methods.
翻译:虽然几乎所有语言任务都以关注为基础的变压器网络都取得了前所未有的成功,但图像中发现的大量符号(像素)加上二次激活记忆的使用,使得它们无法解决计算机视觉方面的问题。 因此,虽然变压器模型已经使语言对语言的翻译发生了革命性的变化,但革命网络仍然是图像到图像翻译的事实上的解决办法。最近提议的MLP-混合器结构缓解了与关注网络有关的一些计算问题,同时仍然保留使变压器模型成为可取的长距离连接。利用这种记忆效率高的变压器替代自我注意,我们提出了一个新的探索模型,称为MixerGAN:一个更简单的MLP型结构,它考虑到像素之间的长距离关系,而不需要昂贵的注意机制。定量和定性分析表明,MixerGAN与先前的变压法方法相比,取得了竞争性的结果。