Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down to 12.1% and 6.4% WER, under the anechoic condition in single-channel and multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER.
翻译:最近,基于端到端完全重复的神经网络模型(RNN)被证明对单声道和多声道情景中的多声音语音识别非常有效。 在这项工作中,我们探索了将这些任务使用变异模型的方法,侧重于两个方面。首先,我们用变动器结构将语音识别模型中基于RNN的编码器解码器替换成一个语音识别模型。第二,为了在多声道案例中使用以神经波状的遮罩网络中的变异器,我们修改自知组件,限制在一个段,而不是整个序列中进行语音识别,以减少计算。除了模型结构改进外,我们还将外部变异处理前、加权预测错误(WPE)纳入外部变异模型,使我们的模型能够用变动信号处理变异信号。对空间化 wsj1-2公文的实验显示,以变异器为基础的模型在单声波状态下调40.9%和25.6%的相对WER值递减到12.1%和6.4%的WER,在单声波状状态下分别实现16 %和多声波递递减方法。