Swinv2-Imagen: 用于生成文字到图像的等级式愿景转换器扩散模型 (Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation)

Recently, diffusion models have been proven to perform remarkably well in text-to-image synthesis tasks in a number of studies, immediately presenting new study opportunities for image generation. Google's Imagen follows this research trend and outperforms DALLE2 as the best model for text-to-image generation. However, Imagen merely uses a T5 language model for text processing, which cannot ensure learning the semantic information of the text. Furthermore, the Efficient UNet leveraged by Imagen is not the best choice in image processing. To address these issues, we propose the Swinv2-Imagen, a novel text-to-image diffusion model based on a Hierarchical Visual Transformer and a Scene Graph incorporating a semantic layout. In the proposed model, the feature vectors of entities and relationships are extracted and involved in the diffusion model, effectively improving the quality of generated images. On top of that, we also introduce a Swin-Transformer-based UNet architecture, called Swinv2-Unet, which can address the problems stemming from the CNN convolution operations. Extensive experiments are conducted to evaluate the performance of the proposed model by using three real-world datasets, i.e., MSCOCO, CUB and MM-CelebA-HQ. The experimental results show that the proposed Swinv2-Imagen model outperforms several popular state-of-the-art methods.

翻译：最近,一些研究证明,传播模型在文本到图像合成任务方面表现非常出色,立即为图像生成提供了新的研究机会。谷歌的图像遵循了这一研究趋势,优于DALLE2,这是生成文本到图像的最佳模式。然而,图像仅仅使用T5语言文本处理模式,无法确保学习文本的语义信息。此外,图像所利用的高效UNet并不是图像处理的最佳选择。为了解决这些问题,我们提议了Swinv2-Imagen,一个基于高端视觉变异器和包含语义版版图的新型文本到图像传播模型。在拟议的模型中,实体和关系的特点矢量矢量和特性矢量被提取并参与到传播模型中,从而无法确保学习文本的语义信息质量。此外,我们还引入了Swin-透明基于图像处理模式的UNet结构,称为Swinv2-Unet,它可以解决CNNC convorial操作中出现的问题。广泛进行了实验,以高端视觉变形图为基础,并使用了包含语义布图布图布图布局布局布局布局布局图的图像。在模拟模型上的拟议模型中,使用了真实的三种模型。SIS-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-S-S-S-S-S-S-S-S-S-S-S-SB-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SB-SB-S-SB-SB-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！700+ppt《因果推理》课程！杜克大学Fan Li教程

专知会员服务

72+阅读 · 2022年7月11日

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日