Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods focus on predicting TF-masks or speech spectrum,via a naive convolution neural network or recurrent neural network.Some recent studies were based on Complex spectral Mapping convolution recurrent neural network (CRN) . These models skiped directly from encoder layers' output and decoder layers' input ,which maybe thoughtless. We proposed an attention mechanism based skip connection between encoder and decoder layers,namely Complex Spectral Mapping With Attention Based Convolution Recurrent Neural Network (CARN).Compared with CRN model,the proposed CARN model improved more than 10% relatively at several metrics such as PESQ,CBAK,COVL,CSIG and son,and outperformed the place 1st model in both real time and non-real time track of the DNS Challenge 2020 at these metrics.
翻译:在智能和感知质量方面,深层学习的成功使语音的增强受益。常规时间频域方法侧重于预测TF-maks或言语频谱,通过天真的 convolution 神经网络或经常性神经网络。最近的一些研究基于复杂的光谱映射共发神经网络(CRN) 。这些模型直接跳过编码器层的输出和分解层输入,这或许是没有思维的。我们建议了一种关注机制,它基于编码器层和分解器层之间的跳过连接,即以注意为基点的复杂光谱映射元网络(CARN)。与CRN模型相比,拟议的CARN模型在PESQ、CBAK、COVL、CSIG和儿子等若干指标上相对改进了10%以上,并在这些指标的DNS挑战2020实时和非实时轨道上超越了第一个模型。