We present fast, realistic image generation on high-resolution, multimodal datasets using hierarchical variational autoencoders (VAEs) trained on a deterministic autoencoder's latent space. In this two-stage setup, the autoencoder compresses the image into its semantic features, which are then modeled with a deep VAE. With this method, the VAE avoids modeling the fine-grained details that constitute the majority of the image's code length, allowing it to focus on learning its structural components. We demonstrate the effectiveness of our two-stage approach, achieving a FID of 9.34 on the ImageNet-256 dataset which is comparable to BigGAN. We make our implementation available online.
翻译:本文提出使用分层变分自编码器(VAEs)基于确定性自编码器的潜空间训练,实现快速、逼真的高分辨率、多模态数据集图像生成。在这个两阶段的设置中,自编码器将图像压缩成其语义特征,然后使用深度VAE对其进行建模。使用此方法,VAE避免了对构成图像代码长度大部分的细粒度细节的建模,使其能够集中学习结构组件。我们演示了我们的两阶段方法的有效性,在ImageNet-256数据集上实现了9.34的FID,可与BigGAN相媲美。我们将实现代码提供在线上。