End-to-end models with auto-regressive decoders have shown impressive results for automatic speech recognition (ASR). These models formulate the sequence-level probability as a product of the conditional probabilities of all individual tokens given their histories. However, the performance of locally normalised models can be sub-optimal because of factors such as exposure bias. Consequently, the model distribution differs from the underlying data distribution. In this paper, the residual energy-based model (R-EBM) is proposed to complement the auto-regressive ASR model to close the gap between the two distributions. Meanwhile, R-EBMs can also be regarded as utterance-level confidence estimators, which may benefit many downstream tasks. Experiments on a 100hr LibriSpeech dataset show that R-EBMs can reduce the word error rates (WERs) by 8.2%/6.7% while improving areas under precision-recall curves of confidence scores by 12.6%/28.4% on test-clean/test-other sets. Furthermore, on a state-of-the-art model using self-supervised learning (wav2vec 2.0), R-EBMs still significantly improves both the WER and confidence estimation performance.
翻译:带有自动递进调解调器的端到端模型显示了自动语音识别的令人印象深刻的结果。 这些模型将序列级概率作为所有个人象征具有历史的有条件概率的产物来制定序列级概率。 但是,由于暴露偏差等因素,本地标准化模型的性能可能不理想。 因此,模型分布与基本数据分布不同。 本文提议以剩余能源为基础的模型(R-EBM)作为自动递进式ASR模型的补充,以缩小两个分布之间的距离。 同时,R-EBMS也可以被视为全方位信任度估测器,这可能会有益于许多下游任务。 LibriSpeech 数据集的实验表明,R-EBMS可以将字差率降低8.2%/6.7%,同时在测试-清洁/测试各组中将精确召回的可信度曲线提高12.6%/28.4%。 此外,在使用自我校准的性能评估(2.0)和WERMS(2.Veve)的状态模型上,还可以使用自超式的学习(2.0-BS)。