Self-supervised learning representation (SSLR) has demonstrated its significant effectiveness in automatic speech recognition (ASR), mainly with clean speech. Recent work pointed out the strength of integrating SSLR with single-channel speech enhancement for ASR in noisy environments. This paper further advances this integration by dealing with multi-channel input. We propose a novel end-to-end architecture by integrating dereverberation, beamforming, SSLR, and ASR within a single neural network. Our system achieves the best performance reported in the literature on the CHiME-4 6-channel track with a word error rate (WER) of 1.77%. While the WavLM-based strong SSLR demonstrates promising results by itself, the end-to-end integration with the weighted power minimization distortionless response beamformer, which simultaneously performs dereverberation and denoising, improves WER significantly. Its effectiveness is also validated on the REVERB dataset.
翻译:自监学习代表(SSLR)在自动语音识别(ASR)方面显示了其显著的实效,主要是干净的言语。最近的工作指出了将SSLR与ASR在吵闹环境中的单一频道语音增强相结合的力度。本文件通过处理多频道输入进一步推进了这种整合。我们提议在单一神经网络中整合一个新型端对端结构,将皮肤变异、波形变形、SSLR和ASR纳入一个神经网络。我们的系统实现了CHime-4 6频道轨道文献中报告的最佳性能,单词误差率为1.77%。WavLM的强大SSLR本身显示了有希望的结果,但端对端的整合与加权电源最小化的无扭曲性反应是前导的,同时进行脱色变和脱色,显著改善WER。它的效力也在RWEWER数据集上得到验证。