Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView (zero-shot) achieves a new state-of-the-art FID on blurred MS COCO, outperforms previous GAN-based models and a recent similar work DALL-E.
翻译:在一般领域,文字到图像的生成长期以来一直是一个尚未解决的问题,这需要强大的基因模型和跨模式的理解。我们提议CogView,这是一个40亿个参数的变异器,配有VQ-VAE代谢器,以推进这一问题。我们还展示了各种下游任务的微调战略,例如,风格学习、超级分辨率、文字图像排位和时装设计,以及稳定培训前阶段的方法,例如,消除NAN损失。 CogView(零弹射)在模糊的MS COCO上实现了新的最先进的FID,超越了以前的GAN模型和最近的类似DALL-E工作。