Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and class-conditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, a latent diffusion model with a small U-ViT achieves a record-breaking FID of 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models. Besides, our results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets.
翻译:视觉变压器(ViT)在各种愿景任务中表现出了希望,而基于革命神经网络(CNN)的U-Net在传播模型中仍然占据主导地位。我们设计了一个简单和通用的ViT基础建筑(名为U-ViT),用于以扩散模型生成图像。U-ViT的特点是将所有投入(包括时间、状况和噪音图像补丁)作为象征,并采用浅层和深层之间长期跳过连接的方法处理。我们用无条件和低级图像生成以及文本到图像生成任务来评价U-Vit,其中U-ViT即使不优于以CNN为基础的类似规模的U-Net,也是具有可比性的。特别是,一个具有小型U-ViT的潜伏传播模型,在MS-CO的文本到图像生成中实现了5.48的破纪录的FID,而没有在基因化模型培训期间使用大型外部数据集。此外,我们的结果表明,在基于传播图像建模的长空连接十分关键,而基于CNNM-Viel研究中的缩小和升级操作器的操作者则不必要地提供大规模的U-Vireal-vial-sm-smalal-smal-smudal 。