Many purely neural network based speech separation approaches have been proposed that greatly improve objective assessment scores, but they often introduce nonlinear distortions that are harmful to automatic speech recognition (ASR). Minimum variance distortionless response (MVDR) filters strive to remove nonlinear distortions, however, these approaches either are not optimal for removing residual (linear) noise, or they are unstable when used jointly with neural networks. In this study, we propose a multi-channel multi-frame (MCMF) all deep learning (ADL)-MVDR approach for target speech separation, which extends our preliminary multi-channel ADL-MVDR approach. The MCMF ADL-MVDR handles different numbers of microphone channels in one framework, where it addresses linear and nonlinear distortions. Spatio-temporal cross correlations are also fully utilized in the proposed approach. The proposed system is evaluated using a Mandarin audio-visual corpora and is compared with several state-of-the-art approaches. Experimental results demonstrate the superiority of our proposed framework under different scenarios and across several objective evaluation metrics, including ASR performance.
翻译:许多纯粹以神经网络为基础的言语分离方法已经提出,大大改进了客观评估分数,但往往会引入非线性扭曲现象,有害于自动语音识别(ASR)。最低差异无扭曲反应(MVDR)过滤器努力消除非线性扭曲现象,然而,这些方法不是消除残余(线性)噪音的最佳方法,或者当与神经网络联合使用时,这些方法不稳定。在本研究中,我们建议采用多渠道多渠道多框架(MCMMF)所有深度学习(ADL)-MVDR(MMDR)方法,用于目标语音分离,该方法扩展了我们的初步多频道ADL-MDR(ADL)-MDR(ADL)方法。MMF ADL-MDDR(MDDD)-MDR(MDR)处理一个框架内不同数量的麦克风频道,处理线性和非线性扭曲现象。Spatio-时际交叉关系也在拟议方法中得到充分利用。拟议系统是使用曼达林视听公司,并与若干最先进的方法进行比较。实验结果显示我们提议的框架在不同情景下和跨越若干客观评价指标的优势。实验结果。