Recently, the standard variational autoencoder has been successfully used to learn a probabilistic prior over speech signals, which is then used to perform speech enhancement. Variational autoencoders have then been conditioned on a label describing a high-level speech attribute (e.g. speech activity) that allows for a more explicit control of speech generation. However, the label is not guaranteed to be disentangled from the other latent variables, which results in limited performance improvements compared to the standard variational autoencoder. In this work, we propose to use an adversarial training scheme for variational autoencoders to disentangle the label from the other latent variables. At training, we use a discriminator that competes with the encoder of the variational autoencoder. Simultaneously, we also use an additional encoder that estimates the label for the decoder of the variational autoencoder, which proves to be crucial to learn disentanglement. We show the benefit of the proposed disentanglement learning when a voice activity label, estimated from visual data, is used for speech enhancement.
翻译:最近,标准变换自动编码器被成功地用于学习对语音信号的概率性能,然后用来进行语音增强。变式自动编码器随后以描述高级语音属性(例如语音活动)的标签为条件,该标签可以更明确地控制语音生成。然而,该标签不能保证与其他潜在变量脱钩,这导致与标准变换自动编码器相比性能改进有限。在这项工作中,我们提议对变换自动编码器使用对抗性训练计划,将标签与其他潜在变量分离。在培训中,我们使用一个与变换自动编码器编码者竞争的制导师。同时,我们还使用一个额外的编码器来估计变换自动编码器的解密器的标签,这证明对学习解调至关重要。我们展示了在使用从视觉数据估计的语音活动标签用于增强语音时,拟议解动性能学习的学习的好处。