Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward pass. They are thus much faster, but they currently remain far behind the state-of-the-art in large-scale text-to-image synthesis. This paper aims to identify the necessary steps to regain competitiveness. Our proposed model, StyleGAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. StyleGAN-T significantly improves over previous GANs and outperforms distilled diffusion models - the previous state-of-the-art in fast text-to-image synthesis - in terms of sample quality and speed.
翻译:由于经过大量预先培训的语言模型、大规模培训数据以及采用扩散和自动递减模型等可缩放模型型群,最近取得了显著进展。然而,最佳模型需要迭代评价才能产生单一样本。相比之下,基因对抗网络只需要一个前方路口,因此速度要快得多,但目前仍远远落后于大规模文本到图像合成中的最新水平。本文件旨在确定恢复竞争力的必要步骤。我们提议的模型SteleGAN-T, 处理大规模文本到图像合成的具体要求,如大型能力、关于不同数据集的稳定培训、强有力的文本对齐和可控变异与文本对齐的权衡。StelegGAN-T在样本质量和速度方面大大改进了以前的GAN-Tween-T, 超越了经过蒸馏的传播模型,即以往的文本到图像快速合成中的最新状态。