Synthetic data generation is increasingly important due to privacy concerns. While Autoencoder-based approaches have been widely used for this purpose, sampling from their latent spaces can be challenging. Mixture models are currently the most efficient way to sample from these spaces. In this work, we propose a new approach that models the latent space of an Autoencoder as a simplex, allowing for a novel heuristic for determining the number of components in the mixture model. This heuristic is independent of the number of classes and produces comparable results. We also introduce a sampling method based on probability mass functions, taking advantage of the compactness of the latent space. We evaluate our approaches on a synthetic dataset and demonstrate their performance on three benchmark datasets: MNIST, CIFAR-10, and Celeba. Our approach achieves an image generation FID of 4.29, 13.55, and 11.90 on the MNIST, CIFAR-10, and Celeba datasets, respectively. The best AE FID results to date on those datasets are respectively 6.3, 85.3 and 35.6 we hence substantially improve those figures (the lower is the FID the better). However, AEs are not the best performing algorithms on the concerned datasets and all FID records are currently held by GANs. While we do not perform better than GANs on CIFAR and Celeba we do manage to squeeze-out a non-negligible improvement (of 0.21) over the current GAN-held record for the MNIST dataset.
翻译:由于隐私问题,合成数据的生成越来越重要。虽然基于自动编码器的方法已为此广泛使用,但从其潜在空间取样可能具有挑战性。混合模型是目前从这些空间取样的最有效方法。在这项工作中,我们提出一种新的方法,将自动编码器的潜在空间作为简单的模型来模拟,这样就可以对确定混合物模型中组件的数量采取新的杂交法。这种杂交法独立于类别数量并产生可比较的结果。我们还采用基于概率质量功能的抽样方法,利用潜在空间的紧凑性。我们评估合成数据集的方法,并在三个基准数据集(MMIST、CIFAR-10和Celeba)上展示其性能。我们的方法将自动编码器的潜在空间建成一个简单的模型,将自动编码器的潜在空间建成一个4.29、13.55和11.90的图像生成FIDS、CIFAR-10和Celeba数据集分别为6.3、85.3和35.6,因此我们大大改进了这些数据集(目前不甚低的AS-AN数据运行情况,而我们没有更好地管理G-A-R数据库)。