Many purely neural network based speech separation approaches have been proposed to improve objective assessment scores, but they often introduce nonlinear distortions that are harmful to modern automatic speech recognition (ASR) systems. Minimum variance distortionless response (MVDR) filters are often adopted to remove nonlinear distortions, however, conventional neural mask-based MVDR systems still result in relatively high levels of residual noise. Moreover, the matrix inverse involved in the MVDR solution is sometimes numerically unstable during joint training with neural networks. In this study, we propose a multi-channel multi-frame (MCMF) all deep learning (ADL)-MVDR approach for target speech separation, which extends our preliminary multi-channel ADL-MVDR approach. The proposed MCMF ADL-MVDR system addresses linear and nonlinear distortions. Spatio-temporal cross correlations are also fully utilized in the proposed approach. The proposed systems are evaluated using a Mandarin audio-visual corpus and are compared with several state-of-the-art approaches. Experimental results demonstrate the superiority of our proposed systems under different scenarios and across several objective evaluation metrics, including ASR performance.
翻译:提出了许多纯粹基于神经网络的语音分离方法,以改善客观评估分数,但往往采用非线性扭曲方法,有害现代自动语音识别系统; 往往采用最低差异无偏差反应过滤器,消除非线性扭曲,然而,传统的以神经面具为基础的MVDR系统仍然造成相对较高的残余噪音; 此外,在与神经网络联合培训期间,MVDR解决方案所涉及的矩阵有时在数字上不稳定; 在本研究中,我们提议采用多渠道多框架(MMCMF),所有深度学习(ADL)-MVDR方法,用于目标语音分离,这扩展了我们最初的多频道ADL-MVDR方法; 拟议的MCMF ADL-MVDR系统处理线性和非线性扭曲问题; 拟议的方法还充分利用了Spatio-时空交叉关系; 拟议的系统使用曼达林音像资料库进行了评价,并与若干最先进的方法进行了比较; 实验结果显示我们提议的系统在不同的情景下和跨越若干客观评价指标,包括ASR。