Hybrid Autoregressive Transducer (HAT) is a recently proposed end-to-end acoustic model that extends the standard Recurrent Neural Network Transducer (RNN-T) for the purpose of the external language model (LM) fusion. In HAT, the blank probability and the label probability are estimated using two separate probability distributions, which provides a more accurate solution for internal LM score estimation, and thus works better when combining with an external LM. Previous work mainly focuses on HAT model training with the negative log-likelihood loss, while in this paper, we study the minimum word error rate (MWER) training of HAT -- a criterion that is closer to the evaluation metric for speech recognition, and has been successfully applied to other types of end-to-end models such as sequence-to-sequence (S2S) and RNN-T models. From experiments with around 30,000 hours of training data, we show that MWER training can improve the accuracy of HAT models, while at the same time, improving the robustness of the model against the decoding hyper-parameters such as length normalization and decoding beam during inference.
翻译:自动递增式自动转换器(HAT)是最近提议的一种端到端的声学模型,它扩展了标准的常态神经网络转换器(RNNN-T),用于外部语言模型(LM)融合。在HAT中,空白概率和标签概率使用两种不同的概率分布估计,为内部LM评分估计提供了更准确的解决方案,因此与外部LM合并时效果更好。 以往的工作主要侧重于HAT模型培训,加上负日志损失。 而在本文件中,我们研究HAT的最小字差率(MWER)培训 -- -- 这一标准更接近语音识别的评价度标准,并成功应用于其他类型的端到端模型,如顺序到序列(S2S)和RNNN-T模型。 从大约30 000小时的培训数据实验中,我们表明MWER培训可以提高HAT模型的准确性,同时提高模型相对于分解的超参数的稳健性,例如长度正常化和分解剖期间。