Recently, diffusion models have been proven to perform remarkably well in text-to-image synthesis tasks in a number of studies, immediately presenting new study opportunities for image generation. Google's Imagen follows this research trend and outperforms DALLE2 as the best model for text-to-image generation. However, Imagen merely uses a T5 language model for text processing, which cannot ensure learning the semantic information of the text. Furthermore, the Efficient UNet leveraged by Imagen is not the best choice in image processing. To address these issues, we propose the Swinv2-Imagen, a novel text-to-image diffusion model based on a Hierarchical Visual Transformer and a Scene Graph incorporating a semantic layout. In the proposed model, the feature vectors of entities and relationships are extracted and involved in the diffusion model, effectively improving the quality of generated images. On top of that, we also introduce a Swin-Transformer-based UNet architecture, called Swinv2-Unet, which can address the problems stemming from the CNN convolution operations. Extensive experiments are conducted to evaluate the performance of the proposed model by using three real-world datasets, i.e., MSCOCO, CUB and MM-CelebA-HQ. The experimental results show that the proposed Swinv2-Imagen model outperforms several popular state-of-the-art methods.
翻译:最近,一些研究证明,传播模型在文本到图像合成任务方面表现非常出色,立即为图像生成提供了新的研究机会。谷歌的图像遵循了这一研究趋势,优于DALLE2,这是生成文本到图像的最佳模式。然而,图像仅仅使用T5语言文本处理模式,无法确保学习文本的语义信息。此外,图像所利用的高效UNet并不是图像处理的最佳选择。为了解决这些问题,我们提议了Swinv2-Imagen,一个基于高端视觉变异器和包含语义版版图的新型文本到图像传播模型。在拟议的模型中,实体和关系的特点矢量矢量和特性矢量被提取并参与到传播模型中,从而无法确保学习文本的语义信息质量。此外,我们还引入了Swin-透明基于图像处理模式的UNet结构,称为Swinv2-Unet,它可以解决CNNC convorial操作中出现的问题。广泛进行了实验,以高端视觉变形图为基础,并使用了包含语义布图布图布图布局布局布局布局布局图的图像。在模拟模型上的拟议模型中,使用了真实的三种模型。SIS-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-S-S-S-S-S-S-S-S-S-S-S-SB-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SB-SB-S-SB-SB-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-