Speech enhancement in the time domain is becoming increasingly popular in recent years, due to its capability to jointly enhance both the magnitude and the phase of speech. In this work, we propose a dense convolutional network (DCN) with self-attention for speech enhancement in the time domain. DCN is an encoder and decoder based architecture with skip connections. Each layer in the encoder and the decoder comprises a dense block and an attention module. Dense blocks and attention modules help in feature extraction using a combination of feature reuse, increased network depth, and maximum context aggregation. Furthermore, we reveal previously unknown problems with a loss based on the spectral magnitude of enhanced speech. To alleviate these problems, we propose a novel loss based on magnitudes of enhanced speech and a predicted noise. Even though the proposed loss is based on magnitudes only, a constraint imposed by noise prediction ensures that the loss enhances both magnitude and phase. Experimental results demonstrate that DCN trained with the proposed loss substantially outperforms other state-of-the-art approaches to causal and non-causal speech enhancement.
翻译:近些年来,时间域的语音增强越来越受欢迎,这是因为它有能力共同提高语音的强度和阶段。在这项工作中,我们提议建立一个密集的革命网络(DCN),在时间域内自我关注增强语音。DCN是一个基于编码器和解码器的建筑结构,有跳过连接。编码器和解码器的每个层都包含一个密集的块块和关注模块。密度块和关注模块有助于利用特征再利用、增加网络深度和最大背景集成的组合进行特征提取。此外,我们揭示了以前未知的基于强化语音光谱量的损失问题。为了缓解这些问题,我们提议根据强化语音和预测的噪音的强度进行新的损失。即使拟议的损失仅以数量为基础,噪音预测造成的限制确保损失既能增加规模,又能增加阶段性。实验结果表明,拟议的损失培训DCN大大优于其他改善因果和非因果语音的状态方法。