Multi-channel inputs offer several advantages over single-channel, to improve the robustness of on-device speech recognition systems. Recent work on multi-channel transformer, has proposed a way to incorporate such inputs into end-to-end ASR for improved accuracy. However, this approach is characterized by a high computational complexity, which prevents it from being deployed in on-device systems. In this paper, we present a novel speech recognition model, Multi-Channel Transformer Transducer (MCTT), which features end-to-end multi-channel training, low computation cost, and low latency so that it is suitable for streaming decoding in on-device speech recognition. In a far-field in-house dataset, our MCTT outperforms stagewise multi-channel models with transformer-transducer up to 6.01% relative WER improvement (WERR). In addition, MCTT outperforms the multi-channel transformer up to 11.62% WERR, and is 15.8 times faster in terms of inference speed. We further show that we can improve the computational cost of MCTT by constraining the future and previous context in attention computations.
翻译:多通道投入比单一通道具有若干优势,可以提高台式语音识别系统的稳健性。关于多通道变压器的近期工作已经提出一种方法,将这些输入纳入终端到终端到终端的ASR,以提高准确性。然而,这一方法的特点是计算复杂程度高,无法在设备系统中部署。在本文中,我们提出了一个新的语音识别模型,即多通道变压器转换器转换器(MCTT),其特点是端到端多通道培训、低计算成本和低悬浮度,以便适合在语音识别中进行解码。在一个远处的内部数据集中,我们的MCTTT超越了具有6.01%的变压器-传感器相对WER改进(WERRR)的阶段性多通道模型。此外,MCT将多频道变压成11.62%的WERRR,在推断速度方面速度是15.8倍。我们进一步显示,我们可以通过以往的计算成本来改进未来的计算方法。