Contextual ASR, which takes a list of bias terms as input along with audio, has drawn recent interest as ASR use becomes more widespread. We are releasing contextual biasing lists to accompany the Earnings21 dataset, creating a public benchmark for this task. We present baseline results on this benchmark using a pretrained end-to-end ASR model from the WeNet toolkit. We show results for shallow fusion contextual biasing applied to two different decoding algorithms. Our baseline results confirm observations that end-to-end models struggle in particular with words that are rarely or never seen during training, and that existing shallow fusion techniques do not adequately address this problem. We propose an alternate spelling prediction model that improves recall of rare words by 34.7% relative and of out-of-vocabulary words by 97.2% relative, compared to contextual biasing without alternate spellings. This model is conceptually similar to ones used in prior work, but is simpler to implement as it does not rely on either a pronunciation dictionary or an existing text-to-speech system.
翻译:随着ASR的使用越来越普遍,ASR(将偏见术语与音频相提并论)最近引起了人们的兴趣。我们正在发布背景偏差列表,以配合收入21数据集,为这项任务建立公共基准。我们使用WeNet工具包中预先培训的终端到终端 ASR模型,在此基准上提出基线结果。我们展示了适用于两种不同的解码算法的浅相融合背景偏差结果。我们的基线结果证实了以下观察,即终端到终端模型在挣扎,特别是在培训期间很少看到或从未看到过的词,以及现有的浅聚变技术无法充分解决这一问题。我们提出了一个替代拼写预测模型,将稀有字数的回溯率提高34.7%相对比重97.2%,而没有替代拼写,比背景偏差增加97.2%。这个模型在概念上类似于先前工作中使用的模式,但比较简单,因为它既不依赖读音词典,也不依赖现有的文本到语音系统。