Automatic speech recognition (ASR) systems struggle with domain-specific named entities, especially homophones. Contextual ASR improves recognition but often fails to capture fine-grained phoneme variations due to limited entity diversity. Moreover, prior methods treat entities as independent tokens, leading to incomplete multi-token biasing. To address these issues, we propose Phoneme-Augmented Robust Contextual ASR via COntrastive entity disambiguation (PARCO), which integrates phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering. These components enhance phonetic discrimination, ensure complete entity retrieval, and reduce false positives under uncertainty. Experiments show that PARCO achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2 under 1,000 distractors, significantly outperforming baselines. PARCO also demonstrates robust gains on out-of-domain datasets like THCHS-30 and LibriSpeech.
翻译:自动语音识别(ASR)系统在处理领域特定命名实体(尤其是同音异义词)时面临挑战。上下文ASR虽能提升识别性能,但由于实体多样性有限,往往难以捕捉细粒度的音素变体。此外,现有方法通常将实体视为独立标记进行处理,导致多标记偏置不完整。为解决这些问题,我们提出基于对比实体消歧的音素增强鲁棒上下文自动语音识别方法(PARCO),该方法融合了音素感知编码、对比实体消歧、实体级监督和分层实体过滤机制。这些组件共同增强了音素区分能力,确保完整实体检索,并在不确定性条件下降低误报率。实验表明,在1000个干扰项条件下,PARCO在中文AISHELL-1数据集上取得4.22%的字错误率(CER),在英文DATA2数据集上取得11.14%的词错误率(WER),显著优于基线方法。PARCO在THCHS-30和LibriSpeech等跨领域数据集上也展现出稳定的性能提升。