Recently, Vision Transformers (ViTs) have shown competitive performance on image recognition while requiring less vision-specific inductive biases. In this paper, we investigate if such observation can be extended to image generation. To this end, we integrate the ViT architecture into generative adversarial networks (GANs). We observe that existing regularization methods for GANs interact poorly with self-attention, causing serious instability during training. To resolve this issue, we introduce novel regularization techniques for training GANs with ViTs. Empirically, our approach, named ViTGAN, achieves comparable performance to state-of-the-art CNN-based StyleGAN2 on CIFAR-10, CelebA, and LSUN bedroom datasets.
翻译:最近,视觉变异器在图像识别方面表现出了竞争性的成绩,但要求的视觉感官偏差较少。在本文中,我们调查这种观察能否扩大到图像生成。为此目的,我们将ViT结构纳入基因对抗网络(GANs )。我们发现,GAN的现有正规化方法与自我意识不相吻合,在培训期间造成严重不稳定。为了解决这个问题,我们采用了新型的正规化技术,用ViTs来培训GANs。我们称为ViTGAN(ViTGAN)的方法很巧妙,在CIFAR-10、CelibA和LSUN卧室数据集方面,取得了与最先进的CNNSyleGAN2相似的业绩。