Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration approach, in which we recreate the speech from its essence, including the semi-recognized speech, prosody features, and identity. We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner and which utilizes a compact speech representation, composed of ASR and identity features, to achieve a higher level of intelligibility. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source using an auxiliary identity network. Perceptual acoustic metrics and subjective tests show that the method obtains valuable improvements over recent baselines.
翻译:近些年来,加强语言能力的工作有了很大的改进,主要是通过促进消除音响的音响、语音隔离和偏差方法,主要处理声频对环境的影响。为了在原始信号的局限性之外加强语言能力,我们采取了一种再生方法,即我们从本质上重新创造语言,包括半承认的语音、假言特征和身份。我们提出了一个可以实时生成24赫兹语的 wav-to-wav 发音的发音模式,并且利用由ASR和身份特征特征组成的精密语言代表,实现更高程度的智能。我们受到声音转换方法的启发,我们训练在使用辅助身份网络维护源特性的同时增加语言特征。有概念的声学指标和主观测试表明,该方法在近期基线上取得了宝贵的改进。