Recently most successful image synthesis models are multi stage process to combine the advantages of different methods, which always includes a VAE-like model for faithfully reconstructing embedding to image and a prior model to generate image embedding. At the same time, diffusion models have shown be capacity to generate high-quality synthetic images. Our work proposes a VQ-VAE architecture model with a diffusion decoder (DiVAE) to work as the reconstructing component in image synthesis. We explore how to input image embedding into diffusion model for excellent performance and find that simple modification on diffusion's UNet can achieve it. Training on ImageNet, Our model achieves state-of-the-art results and generates more photorealistic images specifically. In addition, we apply the DiVAE with an Auto-regressive generator on conditional synthesis tasks to perform more human-feeling and detailed samples.
翻译:最近最成功的图像合成模型是将不同方法的优势结合起来的多阶段过程,这些方法总是包括忠实重建图像嵌入的VAE式模型,以及生成图像嵌入的先前模型。与此同时,扩散模型已经证明是生成高质量合成图像的能力。我们的工作提出了VQ-VAE结构模型,其中含有一个扩散解码器(DIVAE),作为图像合成的重建部分。我们探索如何将图像嵌入图像嵌入传播模型,以取得优异的性能,并发现对传播的UNet的简单修改能够实现这一点。关于图像网络的培训,我们的模式取得了最新的结果,并具体产生了更多的光现实图像。此外,我们用一个自动反向生成器将DVAE应用于有条件的合成任务,以进行更多的人感化和详细的样本。