Second-pass rescoring is an important component in automatic speech recognition (ASR) systems that is used to improve the outputs from a first-pass decoder by implementing a lattice rescoring or $n$-best re-ranking. While pretraining with a masked language model (MLM) objective has received great success in various natural language understanding (NLU) tasks, it has not gained traction as a rescoring model for ASR. Specifically, training a bidirectional model like BERT on a discriminative objective such as minimum WER (MWER) has not been explored. Here we show how to train a BERT-based rescoring model with MWER loss, to incorporate the improvements of a discriminative loss into fine-tuning of deep bidirectional pretrained models for ASR. Specifically, we propose a fusion strategy that incorporates the MLM into the discriminative training process to effectively distill knowledge from a pretrained model. We further propose an alternative discriminative loss. This approach, which we call RescoreBERT, reduces WER by 6.6%/3.4% relative on the LibriSpeech clean/other test sets over a BERT baseline without discriminative objective. We also evaluate our method on an internal dataset from a conversational agent and find that it reduces both latency and WER (by 3 to 8% relative) over an LSTM rescoring model.
翻译:第二剖面分解是自动语音识别系统的一个重要部分,用于通过实施拉特比分或以美元为最优的重新排名来改进一级解码器的输出。在使用隐蔽语言模型(MLM)目标的预先训练中,各种自然语言理解(NLU)任务取得了巨大成功,但作为ASR的重新校准模式,它并没有获得牵引力。具体地说,对像WERT这样的双向双向模型进行了关于最低WER(MWER)等歧视性目标的培训。我们在这里展示了如何用MWER损失来训练基于BER的重新校准模型,将歧视损失的改进纳入深双向预先训练模型的微调。具体地说,我们提议了一项融合战略,将MLM纳入歧视性培训进程,以便有效地从一个预先训练的模式中提取知识。我们称之为RescoreBERT(MER)的相对损失。我们称之为RET,将WER值减少6.4%/lat% 相对于LERER(LES)在LISBS-BETR(BE)的基线测试中, 和BEER(BER) 清洁/LEER)也从一个不评估一个BEAR) 基线和BER(BER) 基准的常规方法,一个清洁/R) 和BER(BETR) 。