Thanks to the rise of self-supervised learning, automatic speech recognition (ASR) systems now achieve near-human performance on a wide variety of datasets. However, they still lack generalization capability and are not robust to domain shifts like accent variations. In this work, we use speech audio representing four different French accents to create fine-tuning datasets that improve the robustness of pre-trained ASR models. By incorporating various accents in the training set, we obtain both in-domain and out-of-domain improvements. Our numerical experiments show that we can reduce error rates by up to 25% (relative) on African and Belgian accents compared to single-domain training while keeping a good performance on standard French.
翻译:由于自我监督学习的兴起,自动语音识别(ASR)系统现在在各种各样的数据集上实现了近乎人的性能。 但是,它们仍然缺乏一般化能力,对口音变异等域性变化也不够强大。 在这项工作中,我们使用代表四个不同法语口音的语音声音来创建微调数据集,以提高预先培训的ASR模型的稳健性。 通过将各种口音纳入培训组合,我们获得了内部和外部的改进。我们的数字实验显示,我们可以将非洲和比利时口音的误差率降低25%(相对性),而将单一语言培训的误差率降低到25%,同时保持标准的法语水平。</s>