Variational autoencoders (VAEs) are a leading approach to address the problem of learning disentangled representations. Typically a single VAE is used and disentangled representations are sought in its continuous latent space. Here we explore a different approach by using discrete latents to combine VAE-representations of individual sources. The combination is done based on an explicit model for source combination, and we here use a linear combination model which is well suited, e.g., for acoustic data. We formally define such a multi-stream VAE (MS-VAE) approach, derive its inference and learning equations, and we numerically investigate its principled functionality. The MS-VAE is domain-agnostic, and we here explore its ability to separate sources into different streams using superimposed hand-written digits, and mixed acoustic sources in a speaker diarization task. We observe a clear separation of digits, and on speaker diarization we observe an especially low rate of missed speakers. Numerical experiments further highlight the flexibility of the approach across varying amounts of supervision and training data.
翻译:变分自编码器(VAEs)是解决学习解耦表示问题的主流方法。通常采用单一变分自编码器,并在其连续潜在空间中寻求解耦表示。本文探索了一种不同的方法:利用离散潜在变量来组合各个源的变分自编码器表示。这种组合基于显式的源组合模型,本文采用适用于声学数据等场景的线性组合模型。我们正式定义了这种多流变分自编码器(MS-VAE)方法,推导了其推断与学习方程,并通过数值实验验证了其原理性功能。MS-VAE具有领域无关性,本文通过叠加手写数字的分离任务和说话人日志任务中的混合声学源分离,探究了其将不同源分离至独立流的能力。实验观察到数字被清晰分离,在说话人日志任务中尤其表现出极低的说话人漏检率。数值实验进一步证明了该方法在不同监督程度和训练数据量下的灵活性。