Variational autoencoders (VAEs) are a popular class of deep generative models with many variants and a wide range of applications. Improvements upon the standard VAE mostly focus on the modelling of the posterior distribution over the latent space and the properties of the neural network decoder. In contrast, improving the model for the observational distribution is rarely considered and typically defaults to a pixel-wise independent categorical or normal distribution. In image synthesis, sampling from such distributions produces spatially-incoherent results with uncorrelated pixel noise, resulting in only the sample mean being somewhat useful as an output prediction. In this paper, we aim to stay true to VAE theory by improving the samples from the observational distribution. We propose an alternative model for the observation space, encoding spatial dependencies via a low-rank parameterisation. We demonstrate that this new observational distribution has the ability to capture relevant covariance between pixels, resulting in spatially-coherent samples. In contrast to pixel-wise independent distributions, our samples seem to contain semantically meaningful variations from the mean allowing the prediction of multiple plausible outputs with a single forward pass.
翻译:在图像合成中,从这种分布中取样产生与空间不相协调的结果,产生与空间不相容的像素噪声,结果只有样本才具有某种作用,作为输出预测。在本文件中,我们力求通过改进观测分布的样本来保持对VAE理论的正确性。我们提出了观测空间的替代模型,通过低级别参数化来调整空间依赖性。我们证明,这种新的观测分布能够捕捉像素之间的相关共变性,从而产生空间相容的样本。与像素独立分布相比,我们的样本似乎含有从中推导出具有说服力的单一前方输出的具有真实性的变化。