In this work, we propose a new generative model that is capable of automatically decoupling global and local representations of images in an entirely unsupervised setting, by embedding a generative flow in the VAE framework to model the decoder. Specifically, the proposed model utilizes the variational auto-encoding framework to learn a (low-dimensional) vector of latent variables to capture the global information of an image, which is fed as a conditional input to a flow-based invertible decoder with architecture borrowed from style transfer literature. Experimental results on standard image benchmarks demonstrate the effectiveness of our model in terms of density estimation, image generation and unsupervised representation learning. Importantly, this work demonstrates that with only architectural inductive biases, a generative model with a likelihood-based objective is capable of learning decoupled representations, requiring no explicit supervision. The code for our model is available at https://github.com/XuezheMax/wolf.
翻译:在这项工作中,我们提出了一个新的基因模型,通过在VAE框架中嵌入基因流以模拟解码器,能够在完全无人监督的环境中自动分离全球和当地图像的表达方式。具体地说,拟议模型利用变式自动编码框架学习一种(低维)潜在变量矢量以捕捉图像的全球信息,作为有条件的输入输入,提供给一个以流为基础的、可逆的解码器,并借用了风格传输文献的架构。标准图像基准的实验结果显示了我们模型在密度估计、图像生成和未受监督的代号学习方面的有效性。重要的是,这项工作表明,只要有建筑上的暗示偏差,一个具有可能性目标的基因化模型就能够学习解码表达方式,不需要明确的监督。我们的模型代码可以在https://github.com/XuezheMax/wolf上查阅。