Recently, multi-channel speech enhancement has drawn much interest due to the use of spatial information to distinguish target speech from interfering signal. To make full use of spatial information and neural network based masking estimation, we propose a multi-channel denoising neural network -- Spatial DCCRN. Firstly, we extend S-DCCRN to multi-channel scenario, aiming at performing cascaded sub-channel and full-channel processing strategy, which can model different channels separately. Moreover, instead of only adopting multi-channel spectrum or concatenating first-channel's magnitude and IPD as the model's inputs, we apply an angle feature extraction module (AFE) to extract frame-level angle feature embeddings, which can help the model to apparently perceive spatial information. Finally, since the phenomenon of residual noise will be more serious when the noise and speech exist in the same time frequency (TF) bin, we particularly design a masking and mapping filtering method to substitute the traditional filter-and-sum operation, with the purpose of cascading coarsely denoising, dereverberation and residual noise suppression. The proposed model, Spatial-DCCRN, has surpassed EaBNet, FasNet as well as several competitive models on the L3DAS22 Challenge dataset. Not only the 3D scenario, Spatial-DCCRN outperforms state-of-the-art (SOTA) model MIMO-UNet by a large margin in multiple evaluation metrics on the multi-channel ConferencingSpeech2021 Challenge dataset. Ablation studies also demonstrate the effectiveness of different contributions.
翻译:最近,由于使用空间信息将目标语音与干扰信号区分开来,多通道语音的增强引起了很大的兴趣。为了充分利用基于掩码估计的空间信息和神经网络,我们提议采用多通道解密神经网络 -- -- 空间DCCRN。首先,我们将SDCCRN推广到多通道情景,目的是执行级联子频道和全通道处理战略,可以分别建不同的频道。此外,我们不仅采用多通道频谱或将第一通道的尺寸和IPD作为模型的投入,而且还采用角特征提取模块(AFE)来提取框架一级角地貌特征嵌入器,这可以帮助模型明显地了解空间信息。最后,由于残余噪音现象将更为严重,因为噪音和语言同时存在于同一个频率(TF) bin,我们特别设计了一种掩码和绘图模型,以取代传统的过滤器和堆积操作,目的是进行不精确的分解、弯曲和余噪声调,我们采用了一些网络级数据模型,作为REDA的模型,作为REDFA模型,而仅展示了REDM-CR 。