In recent years, speech processing algorithms have seen tremendous progress primarily due to the deep learning renaissance. This is especially true for speech separation where the time-domain audio separation network (TasNet) has led to significant improvements. However, for the related task of single-speaker speech enhancement, which is of obvious importance, it is yet unknown, if the TasNet architecture is equally successful. In this paper, we show that TasNet improves state-of-the-art also for speech enhancement, and that the largest gains are achieved for modulated noise sources such as speech. Furthermore, we show that TasNet learns an efficient inner-domain representation, where target and noise signal components are highly separable. This is especially true for noise in terms of interfering speech signals, which might explain why TasNet performs so well on the separation task. Additionally, we show that TasNet performs poorly for large frame hops and conjecture that aliasing might be the main cause of this performance drop. Finally, we show that TasNet consistently outperforms a state-of-the-art single-speaker speech enhancement system.
翻译:近年来,语言处理算法取得了巨大进步,这主要归功于深层次的学习复兴。这在语音分离方面尤为如此,因为时空音频分离网络(TasNet)导致显著的改进。然而,对于单声频语音增强的相关任务(其重要性显而易见),如果塔斯网架构同样成功,这一点还不得而知。在本文中,我们显示塔斯网改进了最先进的语音增强工艺,而且调制的语音源(如语音)也取得了最大的收益。此外,我们显示塔斯网学会了高效的内地表达方式,其中目标和噪声信号组件是高度分离的。对于干扰性语音信号而言,这尤其是噪音,这可能解释塔斯网为何在分离任务上表现得如此出色。此外,我们显示塔斯网对于大框架跳出和猜测来说表现下降的主要原因可能是别名。最后,我们显示塔斯网始终超越了最先进的单声频语音增强系统。