Recently, there has been an increasing interest in two-pass streaming end-to-end speech recognition (ASR) that incorporates a 2nd-pass rescoring model on top of the conventional 1st-pass streaming ASR model to improve recognition accuracy while keeping latency low. One of the latest 2nd-pass rescoring model, Transformer Rescorer, takes the n-best initial outputs and audio embeddings from the 1st-pass model, and then choose the best output by re-scoring the n-best initial outputs. However, training this Transformer Rescorer requires expensive paired audio-text training data because the model uses audio embeddings as input. In this work, we present our Joint Audio/Text training method for Transformer Rescorer, to leverage unpaired text-only data which is relatively cheaper than paired audio-text data. We evaluate Transformer Rescorer with our Joint Audio/Text training on Librispeech dataset as well as our large-scale in-house dataset and show that our training method can improve word error rate (WER) significantly compared to standard Transformer Rescorer without requiring any extra model parameters or latency.
翻译:最近,人们越来越关注双通道端到端语音识别(ASR),该选项在常规的 1号通道流动 ASR 模型上包含一个第二通道重校模型,以提高识别精确度,同时保持低潜度。最新的 2号通道重校模型之一“变压器Rescorer ”,从 1号通道模型中选取最佳初始输出和音频嵌入,然后通过重新校对最佳初始输出来选择最佳输出。然而,培训这个变压器需要昂贵的配对式音频培训数据,因为该模型使用音频嵌入作为输入。在此工作中,我们展示了我们用于变压器Rescorer 的联合音频/ext培训方法,以利用比配对的音频文本数据更便宜的纯文本数据。我们用我们关于Librispeech 数据集的联合音频/ext培训以及我们的大型内部数据集来评估变压器,并显示我们的培训方法可以大大地改进文字错误率,而不需要任何额外的变压器参数。