In this paper, we are interested in audio-visual speech separation given a single-channel audio recording as well as visual information (lips movements) associated with each speaker. We propose an unsupervised technique based on audio-visual generative modeling of clean speech. More specifically, during training, a latent variable generative model is learned from clean speech spectrograms using a variational auto-encoder (VAE). To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech (instead of clean speech) as well as the visual data. The visual modality also serves as a prior for latent variables, through a visual network. At test time, the learned generative model (both for speaker-independent and speaker-dependent scenarios) is combined with an unsupervised non-negative matrix factorization (NMF) variance model for background noise. All the latent variables and noise parameters are then estimated by a Monte Carlo expectation-maximization algorithm. Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches as well as a supervised deep learning-based technique.
翻译:在本文中,我们有兴趣通过单声道录音以及与每个发言者相关的视觉信息(短片运动)来进行视听语音分离。我们建议一种基于清洁言语的视听基因模型的不受监督的技术。更具体地说,在培训期间,利用变式自动电解码器(VAE)从清洁语音光谱中学习潜在的变异基因模型。为了更好地利用视觉信息,潜伏变量的后遗物是从混合语句(而不是干净的言语)和视觉数据中推断出来的。视觉模式也通过视觉网络作为潜在变量的先导。在测试时,学习的基因模型(对于依靠扬声器和依赖扬声器的情景)与背景噪音的非超超异变矩阵因子(NMF)变异模型相结合。为了更好地利用视觉信息,所有潜在变量和噪声参数随后由蒙特卡洛期望-摩西化算法估算。我们的实验显示,拟议的以未超式VAE为基础的方法比以NMF为基础的深层次学习技术产生更好的分离性。