Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and class-conditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve record-breaking FID scores of 2.29 in class-conditional image generation on ImageNet 256x256, and 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models. Our results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets.
翻译:视觉变异器(ViT)在各种愿景任务中显示出希望,而基于革命神经网络(CNN)的U-Net在传播模型中仍然占据主导地位。我们设计了一个简单和通用的VIT基础建筑(名为U-ViT),用于以扩散模型生成图像。U-ViT的特点是将所有投入(包括时间、状况和噪音图像补丁)作为代号处理,并采用浅层和深层之间长期跳过连接的方法。我们评估U-ViT在无条件和低级图像生成中,以及文本到图像生成中,U-ViT即使不优于类似规模的CNN U-Net,也具有可比性。特别是,与U-ViT的潜伏传播模型实现了2.29分的破纪录性FID分数,在SimageNet 256和MS-CO的文本到图像生成中实现了2.48分分级,这是在基因模型培训期间无法访问大型外部模型。我们的结果表明,基于传播图像建模模型的长跳连接十分关键,而U-BIS-S-Simpealx-hial模型则始终不必要地展示U-hial-hial-hial-hi-hi-hi-hilling Stem-hi-hilling Stem-shipal-hi-hi-hi-shipal-shipal-shipal-hi-hi-shipal besmalsm-ship-shipal besm-shipalismalsmmmm-shipalsmalsalsmalsmmalsm-sm-shipalsm-sm-sm-sm-shipal-sm-sm-shipal-shipalsm-smmmmmm-shipal-s-shipals-sxalsm-sm-smmm-sm-smal-sal-sal-sal-smmmmmmmmmmmmm-s-s-sm-s-sal-sal-sxal-sal-sal-sal-sal-sxal-sxy-sal-sal-sil</s>