We introduce the Globally Normalized Autoregressive Transducer (GNAT) for addressing the label bias problem in streaming speech recognition. Our solution admits a tractable exact computation of the denominator for the sequence-level normalization. Through theoretical and empirical results, we demonstrate that by switching to a globally normalized model, the word error rate gap between streaming and non-streaming speech-recognition models can be greatly reduced (by more than 50\% on the Librispeech dataset). This model is developed in a modular framework which encompasses all the common neural speech recognition models. The modularity of this framework enables controlled comparison of modelling choices and creation of new models.
翻译:我们引入了全球标准化自动递减转换器(GNAT),以解决在语音识别流中的标签偏见问题。 我们的解决方案允许对序列级正常化的分母进行可移植的精确计算。 我们通过理论和实证结果证明,通过转换到全球标准化模式,流式和非流式语音识别模型之间的字差率差距可以大大缩小(在Librispeech数据集上减少了50%以上 ) 。 这个模型是在包含所有通用神经语音识别模型的模块框架中开发的。 这个框架的模块化使得能够对建模选择和新模型的创建进行有节制的比较。