Hand-crafted spatial features, such as inter-channel intensity difference (IID) and inter-channel phase difference (IPD), play a fundamental role in recent deep learning based dual-microphone speech enhancement (DMSE) systems. However, learning the mutual relationship between artificially designed spatial and spectral features is hard in the end-to-end DMSE. In this work, a novel architecture for DMSE using a multi-head cross-attention based convolutional recurrent network (MHCA-CRN) is presented. The proposed MHCA-CRN model includes a channel-wise encoding structure for preserving intra-channel features and a multi-head cross-attention mechanism for fully exploiting cross-channel features. In addition, the proposed approach specifically formulates the decoder with an extra SNR estimator to estimate frame-level SNR under a multi-task learning framework, which is expected to avoid speech distortion led by end-to-end DMSE module. Finally, a spectral gain function is adopted to further suppress the unnatural residual noise. Experiment results demonstrated superior performance of the proposed model against several state-of-the-art models.
翻译:在这项工作中,介绍了DMESE使用多头交叉关注的多头交叉经常性网络(MHCA-CRN)的新结构。拟议的MHCA-CRN模型包括一个用于保护内通道特征的频道编码结构,以及一个用于充分利用跨通道特征的多头交叉注意机制。此外,拟议方法还专门设计了解码器,由国家情报局以外的估计器在一个多任务学习框架内估算框架层次的SNR,预期这将避免由终端到终端DMSE模块导致的语调扭曲。最后,还采用了光谱增益功能以进一步抑制非自然残余噪音。