Although the conventional mask-based minimum variance distortionless response (MVDR) could reduce the non-linear distortion, the residual noise level of the MVDR separated speech is still high. In this paper, we propose a spatio-temporal recurrent neural network based beamformer (RNN-BF) for target speech separation. This new beamforming framework directly learns the beamforming weights from the estimated speech and noise spatial covariance matrices. Leveraging on the temporal modeling capability of RNNs, the RNN-BF could automatically accumulate the statistics of the speech and noise covariance matrices to learn the frame-level beamforming weights in a recursive way. An RNN-based generalized eigenvalue (RNN-GEV) beamformer and a more generalized RNN beamformer (GRNN-BF) are proposed. We further improve the RNN-GEV and the GRNN-BF by using layer normalization to replace the commonly used mask normalization on the covariance matrices. The proposed GRNN-BF obtains better performance against prior arts in terms of speech quality (PESQ), speech-to-noise ratio (SNR) and word error rate (WER).
翻译:虽然常规的以面具为基础的最低差异扭曲反应(MDDR)可以减少非线性扭曲,但MDDre分离的言论的剩余噪音水平仍然很高。在本文中,我们提议为目标言言言分离建立一个基于Bamex(RNN-BF)的平地-时热经常性神经网络网络(RNNN-BF),这个新的波成形框架直接从估计的言语和噪音空间差异变异矩阵中了解波形权重。利用区域网点点点点点的时间模型化能力,区域网可以自动积累语音和噪声变异差矩阵的统计数据,以循环方式学习框架一级成形加权数,以学习框架一级成形加权数。提出了基于区域网(RNNN-NNN-GEV)的通用电子元值(RNNNN-GEV),以及更普遍的RNNNNN(GNN-BF),我们进一步改进区域网-GNV和GNN-BF,办法是利用层正常化,用层标准化来取代常使用的差质矩阵上的面具正常化正常化。拟议的GNNNNNNN-B在言质量和言质量(P-言质量的言价比率、言价、言价、言价-言价-言价比、言价-言价-言价比、言价比、言价-言价-言价比、言价、言价-言价-言比、言价-言、言价-言价-言比、言比、言比、言、言价-言价-言价比、言价比、言价比、言价比、言价比、言价比、言价比、言价比、言、言、言、言、言价比、言价-言价-言节、言、言、言、言、言、言-言-言、言、言、言-言-言-言、言-言节-言、言、言、言、言、言、言、言、言、言、言、言、言、言、言、言、言、言、言、言、言、言、言比率、言、言价比率、言、言、言、言、言、言、言比率、言、言