Automatic Speech Recognition (ASR) systems are known to exhibit difficulties when transcribing children's speech. This can mainly be attributed to the absence of large children's speech corpora to train robust ASR models and the resulting domain mismatch when decoding children's speech with systems trained on adult data. In this paper, we propose multiple enhancements to alleviate these issues. First, we propose a data augmentation technique based on the source-filter model of speech to close the domain gap between adult and children's speech. This enables us to leverage the data availability of adult speech corpora by making these samples perceptually similar to children's speech. Second, using this augmentation strategy, we apply transfer learning on a Transformer model pre-trained on adult data. This model follows the recently introduced XLS-R architecture, a wav2vec 2.0 model pre-trained on several cross-lingual adult speech corpora to learn general and robust acoustic frame-level representations. Adopting this model for the ASR task using adult data augmented with the proposed source-filter warping strategy and a limited amount of in-domain children's speech significantly outperforms previous state-of-the-art results on the PF-STAR British English Children's Speech corpus with a 4.86% WER on the official test set.
翻译:据了解,自动语音识别系统在翻译儿童讲话时会遇到困难,这主要是因为没有大型儿童语言协会来培训强大的ASR模型,因此在用成人数据培训的系统解码儿童讲话时,没有培养强大的ASR模型,因此造成域错配。在本文件中,我们提出多项改进,以缓解这些问题。首先,我们提议基于发源过滤器语音模型的数据增强技术,以缩小成人与儿童讲话之间的域间差距。这使我们能够利用成人语言协会的数据提供量,使这些样本与儿童讲话有相似感。第二,利用这一增强战略,我们将学习应用在成人数据培训前的变换器模型上。这一模型遵循了最近推出的 XLS-R 结构,即 wav2vec 2.0 模型,对几个跨语言成人演讲协会进行了预先培训,以学习一般和稳健的声学框架级表达方式。我们采用这一模型来完成ASR任务,利用拟议的源过滤策略所强化的成人数据,以及有限的数量在英国官方演讲中,186 儿童演讲的成绩在英国官方演讲中明显超过英国的4.PFS-PRO 。