Contextual biasing improves automatic speech recognition (ASR) by integrating external knowledge, such as user-specific phrases or entities, during decoding. In this work, we use an attention-based biasing decoder to produce scores for candidate phrases based on acoustic information extracted by an ASR encoder, which can be used to filter out unlikely phrases and to calculate bonus for shallow-fusion biasing. We introduce a per-token discriminative objective that encourages higher scores for ground-truth phrases while suppressing distractors. Experiments on the Librispeech biasing benchmark show that our method effectively filters out majority of the candidate phrases, and significantly improves recognition accuracy under different biasing conditions when the scores are used in shallow fusion biasing. Our approach is modular and can be used with any ASR system, and the filtering mechanism can potentially boost performance of other biasing methods.
翻译:上下文偏置通过整合外部知识(如用户特定短语或实体)在解码过程中提升自动语音识别(ASR)性能。本研究采用基于注意力的偏置解码器,根据ASR编码器提取的声学信息为候选短语生成评分,该评分可用于过滤低概率短语并为浅层融合偏置计算奖励分数。我们引入了一种逐令牌判别性目标函数,旨在提升真实短语的评分同时抑制干扰项。在Librispeech偏置基准测试上的实验表明,该方法能有效过滤大部分候选短语,且当评分应用于浅层融合偏置时,在不同偏置条件下显著提升了识别准确率。本方法具有模块化特性,可与任意ASR系统结合使用,其过滤机制有望提升其他偏置方法的性能。