全部都是有价值的单词: 用于扩散模型的 Vit 后骨 (All are Worth Words: A ViT Backbone for Diffusion Models)

Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and class-conditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, a latent diffusion model with a small U-ViT achieves a record-breaking FID of 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models. Besides, our results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets.

翻译：视觉变压器(ViT)在各种愿景任务中表现出了希望,而基于革命神经网络(CNN)的U-Net在传播模型中仍然占据主导地位。我们设计了一个简单和通用的ViT基础建筑(名为U-ViT),用于以扩散模型生成图像。U-ViT的特点是将所有投入(包括时间、状况和噪音图像补丁)作为象征,并采用浅层和深层之间长期跳过连接的方法处理。我们用无条件和低级图像生成以及文本到图像生成任务来评价U-Vit,其中U-ViT即使不优于以CNN为基础的类似规模的U-Net,也是具有可比性的。特别是,一个具有小型U-ViT的潜伏传播模型,在MS-CO的文本到图像生成中实现了5.48的破纪录的FID,而没有在基因化模型培训期间使用大型外部数据集。此外,我们的结果表明,在基于传播图像建模的长空连接十分关键,而基于CNNM-Viel研究中的缩小和升级操作器的操作者则不必要地提供大规模的U-Vireal-vial-sm-smalal-smal-smudal 。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

ICLR 2021杰出论文奖出炉，8篇论文上榜！

专知会员服务

26+阅读 · 2021年4月2日