Contextual biasing is an important and challenging task for end-to-end automatic speech recognition (ASR) systems, which aims to achieve better recognition performance by biasing the ASR system to particular context phrases such as person names, music list, proper nouns, etc. Existing methods mainly include contextual LM biasing and adding bias encoder into end-to-end ASR models. In this work, we introduce a novel approach to do contextual biasing by adding a contextual spelling correction model on top of the end-to-end ASR system. We incorporate contextual information into a sequence-to-sequence spelling correction model with a shared context encoder. Our proposed model includes two different mechanisms: autoregressive (AR) and non-autoregressive (NAR). We propose filtering algorithms to handle large-size context lists, and performance balancing mechanisms to control the biasing degree of the model. We demonstrate the proposed model is a general biasing solution which is domain-insensitive and can be adopted in different scenarios. Experiments show that the proposed method achieves as much as 51% relative word error rate (WER) reduction over ASR system and outperforms traditional biasing methods. Compared to the AR solution, the proposed NAR model reduces model size by 43.2% and speeds up inference by 2.1 times.
翻译:对端到端自动语音识别系统来说,上至端自动语音识别系统是一项重要而具有挑战性的任务,目的是通过将 ASR 系统偏向于个人姓名、音乐列表、适当的名词等特定背景短语,实现更好的识别性。 现有方法主要包括背景LM 偏向和在端到端 ASR 模型中添加偏差编码器。 在这项工作中,我们采用一种新颖的方法,在端到端自动语音识别系统上方添加一个背景拼错校正模型,从而实现背景偏差校正模式,我们把背景信息纳入一个序列到顺序的拼校正模型,并配有共同的环境编码器。我们提议的模型包括两个不同的机制:自动递增(AR)和非自动递增(NAR)。我们建议用过滤算法处理大号背景列表,以及平衡机制以控制模型偏差程度。我们展示了拟议模式是一种普遍的偏差性解决方案,对域不敏感,可在不同的情景中采用。我们进行实验表明,拟议方法在ARS 和ARS 系统上通过常规偏差率方法降低A格式的比速度,使ARS 系统缩小了ARS 和反比ARS 系统缩小了51%的缩。