In this article we introduce the notion of Split Variational Autoencoder (SVAE), whose output $\hat{x}$ is obtained as a weighted sum $\sigma \odot \hat{x_1} + (1-\sigma) \odot \hat{x_2}$ of two generated images $\hat{x_1},\hat{x_2}$, and $\sigma$ is a learned compositional map. The network is trained as a usual Variational Autoencoder with a negative loglikelihood loss between training and reconstructed images. The decomposition is nondeterministic, but follows two main schemes, that we may roughly categorize as either "syntactic" or "semantic". In the first case, the map tends to exploit the strong correlation between adjacent pixels, splitting the image in two complementary high frequency sub-images. In the second case, the map typically focuses on the contours of objects, splitting the image in interesting variations of its content, with more marked and distinctive features. In this case, the Fr\'echet Inception Distance (FID) of $\hat{x_1}$ and $\hat{x_2}$ is usually lower (hence better) than that of $\hat{x}$, that clearly suffers from being the average of the formers. In a sense, a SVAE forces the Variational Autoencoder to {\em make choices}, in contrast with its intrinsic tendency to average between alternatives with the aim to minimize the reconstruction loss towards a specific sample. According to the FID metric, our technique, tested on typical datasets such as Mnist, Cifar10 and Celeba, allows us to outperform all previous purely variational architectures (not relying on normalization flows).
翻译:在此文章中, 我们引入了 Slip Variational Autencoder (SVAE) 的概念, 其输出 $\ hat{x} $( SVAE ) 是一个加权和数 $squm $\ sgma \ had{x_ 1} + (1-\ sgma) \ hat{x_ 2} 美元, 由两种生成的图像 $\ hat{x_ 1} 和 $\ sigma$ (SVAE) 。 网络被训练成一个普通的 Variational- 自动coder, 在训练和重建图像之间有负的对位值损失。 解析是非非确定性, 但是遵循两种主要方案, 我们可能大致将“ 同步” 或“ shoadbot\\\ x} $\ x 美元 美元 。 在第一个案例中, 地图倾向于利用我们相邻的像群之间的紧密关联, 将图像分割成两个相配以高频次图像。 在第二个案例中,, 地图通常以对象以对象为主控点为对象, 将对象, 将图像的对图像进行对比,, 其内, 其内, 将图像的图图图图图在更动, 其内, 其内, 其内, 其内, 其内, 其内, 其内, 其内, 其内, 其内, 其内, 其内向更变变为直为直为直为直为直为直为 。