We study the use of the Wave-U-Net architecture for speech enhancement, a model introduced by Stoller et al for the separation of music vocals and accompaniment. This end-to-end learning method for audio source separation operates directly in the time domain, permitting the integrated modelling of phase information and being able to take large temporal contexts into account. Our experiments show that the proposed method improves several metrics, namely PESQ, CSIG, CBAK, COVL and SSNR, over the state-of-the-art with respect to the speech enhancement task on the Voice Bank corpus (VCTK) dataset. We find that a reduced number of hidden layers is sufficient for speech enhancement in comparison to the original system designed for singing voice separation in music. We see this initial result as an encouraging signal to further explore speech enhancement in the time-domain, both as an end in itself and as a pre-processing step to speech recognition systems.
翻译:我们研究了使用Wave-U-Net结构来增强语言,Stoller等人为分离音乐声响和配合而采用的一种模型。这种用于音源分离的端到端学习方法直接在时间域运行,允许对阶段信息进行综合建模,并能够考虑到大量时间背景。我们的实验表明,拟议方法改进了几个尺度,即PESQ、CSIG、CBAK、COVL和SSNR,超过了语音银行(VCTK)数据集的语音增强任务方面的最新水平。我们发现,与最初设计用于在音乐中唱声分离的系统相比,减少的隐藏层数足以加强语音。我们认为,这一初步结果是一个令人鼓舞的信号,可以进一步探索时间域的语音增强,既作为目的,又作为语音识别系统的处理前步骤。