Speech restoration aims to remove distortions in speech signals. Prior methods mainly focus on single-task speech restoration(SSR), such as speech enhancement or speech declipping. However, SSR systems only focus on one task and do not address the general speech restoration problem. Previous SSR systems also have limited performance in speech restoration tasks such as speech super-resolution. To overcome those limitations, we propose a general speech restoration(GSR) task that attempts to remove multiple distortions simultaneously. Furthermore, we propose VoiceFixer, a generative framework to address the GSR tasks. VoiceFixer consists of an analysis stage and a synthesis stage to mimic the speech analysis and comprehension of the human auditory system. We employ a ResUNet to model the analysis module and a neural vocoder to model the synthesis module. We evaluate VoiceFixer with additive noise, room reverberation, low-resolution, and clipping distortions. Our baseline GSR model achieves a 0.499 higher mean opinion score(MOS) than the speech enhancement SSR model. VoiceFixer further surpasses the GSR baseline model on the MOS score by 0.256. In addition, we observe that VoiceFixer generalizes well to severely degraded real speech recordings, indicating its potential in restoring old movies and historical speeches. The source code is available at https://github.com/haoheliu/voicefixer_main.
翻译:恢复语音的目的是消除语音信号中的扭曲现象; 先前的方法主要侧重于单一任务语言恢复(SSR),如语音增强或语音解析等。 然而, SSR系统仅侧重于一个任务,而没有解决一般语音恢复问题。 以前的SSR系统在语音恢复任务(如超分辨率)中的表现也有限。 为了克服这些限制,我们提议了一般性语音恢复(GSR)任务,以同时消除多重扭曲现象。 此外,我们提议了语音恢复(GSR)任务,这是一个处理GSR任务的发源框架。 语音Fixer是一个分析阶段和一个合成阶段,以模拟语音分析和理解人类听力系统。我们使用 ResUNet来模拟分析模块,而神经电算器来模拟合成模块。 我们用添加噪音、 房间回动、 低分辨率和剪动扭曲来评估语音恢复器。 我们的基线GSR模型比加强语音改革模型的平均值高出0.499。 语音Fix进一步超越了MOS的GSR基线模型, 以0.256/ helmasium 进行模拟。 此外,我们观察了老式的系统代码源代码, 正在重化。