用于加强视听语音增强的变异自动调动器分离学习 (Disentanglement Learning for Variational Autoencoders Applied to Audio-Visual Speech Enhancement)

Recently, the standard variational autoencoder has been successfully used to learn a probabilistic prior over speech signals, which is then used to perform speech enhancement. Variational autoencoders have then been conditioned on a label describing a high-level speech attribute (e.g. speech activity) that allows for a more explicit control of speech generation. However, the label is not guaranteed to be disentangled from the other latent variables, which results in limited performance improvements compared to the standard variational autoencoder. In this work, we propose to use an adversarial training scheme for variational autoencoders to disentangle the label from the other latent variables. At training, we use a discriminator that competes with the encoder of the variational autoencoder. Simultaneously, we also use an additional encoder that estimates the label for the decoder of the variational autoencoder, which proves to be crucial to learn disentanglement. We show the benefit of the proposed disentanglement learning when a voice activity label, estimated from visual data, is used for speech enhancement.

翻译：最近,标准变换自动编码器被成功地用于学习对语音信号的概率性能,然后用来进行语音增强。变式自动编码器随后以描述高级语音属性(例如语音活动)的标签为条件,该标签可以更明确地控制语音生成。然而,该标签不能保证与其他潜在变量脱钩,这导致与标准变换自动编码器相比性能改进有限。在这项工作中,我们提议对变换自动编码器使用对抗性训练计划,将标签与其他潜在变量分离。在培训中,我们使用一个与变换自动编码器编码者竞争的制导师。同时,我们还使用一个额外的编码器来估计变换自动编码器的解密器的标签,这证明对学习解调至关重要。我们展示了在使用从视觉数据估计的语音活动标签用于增强语音时,拟议解动性能学习的学习的好处。

相关内容

自编码器

关注 140

自动编码器是一种人工神经网络，用于以无监督的方式学习有效的数据编码。自动编码器的目的是通过训练网络忽略信号“噪声”来学习一组数据的表示（编码），通常用于降维。与简化方面一起，学习了重构方面，在此，自动编码器尝试从简化编码中生成尽可能接近其原始输入的表示形式，从而得到其名称。基本模型存在几种变体，其目的是迫使学习的输入表示形式具有有用的属性。自动编码器可有效地解决许多应用问题，从面部识别到获取单词的语义。