Children's speech recognition remains challenging due to substantial acoustic and linguistic variability, limited labeled data, and significant differences from adult speech. Speech foundation models can address these challenges through Speech In-Context Learning (SICL), allowing adaptation to new domains without fine-tuning. However, the effectiveness of SICL depends on how in-context examples are selected. We extend an existing retrieval-based method, Text-Embedding KNN for SICL (TICL), introducing an acoustic reranking step to create TICL+. This extension prioritizes examples that are both semantically and acoustically aligned with the test input. Experiments on four children's speech corpora show that TICL+ achieves up to a 53.3% relative word error rate reduction over zero-shot performance and 37.6% over baseline TICL, highlighting the value of combining semantic and acoustic information for robust, scalable ASR in children's speech.
翻译:儿童语音识别由于存在显著的声学和语言变异性、有限的标注数据以及与成人语音的重大差异,仍然具有挑战性。语音基础模型可以通过语音上下文学习(SICL)应对这些挑战,实现在无需微调的情况下适应新领域。然而,SICL的有效性取决于上下文示例的选择方式。我们扩展了一种现有的基于检索的方法——用于SICL的文本嵌入K近邻(TICL),引入了声学重排序步骤,从而创建了TICL+。该扩展优先选择在语义和声学上与测试输入均对齐的示例。在四个儿童语音语料库上的实验表明,TICL+相对于零样本性能实现了高达53.3%的相对词错误率降低,相对于基线TICL实现了37.6%的降低,这突显了结合语义和声学信息对于实现稳健、可扩展的儿童语音自动语音识别的价值。