Speech restoration aims to remove distortions in speech signals. Prior methods mainly focus on single-task speech restoration (SSR), such as speech denoising or speech declipping. However, SSR systems only focus on one task and do not address the general speech restoration problem. In addition, previous SSR systems show limited performance in some speech restoration tasks such as speech super-resolution. To overcome those limitations, we propose a general speech restoration (GSR) task that attempts to remove multiple distortions simultaneously. Furthermore, we propose VoiceFixer, a generative framework to address the GSR task. VoiceFixer consists of an analysis stage and a synthesis stage to mimic the speech analysis and comprehension of the human auditory system. We employ a ResUNet to model the analysis stage and a neural vocoder to model the synthesis stage. We evaluate VoiceFixer with additive noise, room reverberation, low-resolution, and clipping distortions. Our baseline GSR model achieves a 0.499 higher mean opinion score (MOS) than the speech enhancement SSR model. VoiceFixer further surpasses the GSR baseline model on the MOS score by 0.256. Moreover, we observe that VoiceFixer generalizes well to severely degraded real speech recordings, indicating its potential in restoring old movies and historical speeches. The source code is available at https://github.com/haoheliu/voicefixer_main.
翻译:恢复语音的目的是消除语言信号中的扭曲现象; 先前的方法主要侧重于单一任务语言恢复(SSR),如语言淡化或言语淡化等。然而,安全部门改革系统只侧重于一个任务,而没有解决一般性语言恢复问题。此外,以前的安全部门改革系统在一些语言恢复任务中表现有限,例如超分辨率语言恢复任务。为了克服这些限制,我们提议了一个一般性的恢复语音任务,以同时消除多重扭曲现象。此外,我们提议了语音Fixer(GSR),这是一个处理GSR任务的基因化框架。语音Fixer是一个分析阶段和一个合成阶段,以模拟语音分析和理解人类听力系统。我们使用ResUNet来模拟分析阶段,而神经电解码器则模拟合成阶段。我们用添加噪音、房间反响、低分辨率和剪裁剪裁来评估语音恢复器。我们的基线GSRM模型比加强语音系统改革模型的平均值高0.499。 VoiceF进一步超越了对MOS系统语音分析和理解的基底基模型,用0.256来严重恢复历史记录。