In this work, we present an end-to-end binaural speech synthesis system that combines a low-bitrate audio codec with a powerful binaural decoder that is capable of accurate speech binauralization while faithfully reconstructing environmental factors like ambient noise or reverb. The network is a modified vector-quantized variational autoencoder, trained with several carefully designed objectives, including an adversarial loss. We evaluate the proposed system on an internal binaural dataset with objective metrics and a perceptual study. Results show that the proposed approach matches the ground truth data more closely than previous methods. In particular, we demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
翻译:在这项工作中,我们提出了一个端到端双声传话合成系统,将低位音解码器与强大的二进制解码器结合起来,能够准确的言语二进制,同时忠实地重建环境因素,如环境噪音或回动。网络是一个经修改的矢量定量变异自动编码器,经过一些精心设计的目标,包括对抗性损失。我们用客观的尺度和感知性研究来评估一个内部双进制数据集的拟议系统。结果显示,拟议方法比以往方法更接近地面的真相数据。特别是,我们展示了在捕捉创造真实的听力场所需的环境效应时,对抗性损失的能力。