Modern speech enhancement (SE) networks typically implement noise suppression through time-frequency masking, latent representation masking, or discriminative signal prediction. In contrast, some recent works explore SE via generative speech synthesis, where the system's output is synthesized by a neural vocoder after an inherently lossy feature-denoising step. In this paper, we propose a denoising vocoder (DeVo) approach, where a vocoder accepts noisy representations and learns to directly synthesize clean speech. We leverage rich representations from self-supervised learning (SSL) speech models to discover relevant features. We conduct a candidate search across 15 potential SSL front-ends and subsequently train our vocoder adversarially with the best SSL configuration. Additionally, we demonstrate a causal version capable of running on streaming audio with 10ms latency and minimal performance degradation. Finally, we conduct both objective evaluations and subjective listening studies to show our system improves objective metrics and outperforms an existing state-of-the-art SE model subjectively.
翻译:现代语音增强(SE)网络通常通过时频遮掩、潜在代表面遮掩或歧视性信号预测来抑制噪音。相比之下,最近的一些工作通过基因化语音合成来探索SE, 该系统的输出在自然损失的地貌失落步骤之后由神经电动电码合成。 在本文中,我们建议取消电码(DeVo)方法, 使电码器接受吵闹的演示, 并学习直接合成干净的演讲。 我们利用自我监督的学习语言模型的丰富代表来发现相关特征。 我们通过潜在SSL前端进行候选人搜索, 并随后用最好的SSL配置对立性地培训我们的vocoder。 此外, 我们展示了能够以10毫米的悬浮和最小性能退化运行流听力的因果关系。 最后, 我们进行客观评价和主观的听力研究, 以显示我们的系统改进了客观的测量标准, 并超越了现有的SE型状态。