Despite recent advancements in deep learning technologies, Child Speech Recognition remains a challenging task. Current Automatic Speech Recognition (ASR) models required substantial amounts of annotated data for training, which is scarce. In this work, we explore using the ASR model, wav2vec2, with different pretraining and finetuning configurations for self supervised learning (SSL) towards improving automatic child speech recognition. The pretrained wav2vec2 models were finetuned using different amounts of child speech training data to discover the optimum amount of data required to finetune the model for the task of child ASR. Our trained model receives the best word error rate (WER) of 8.37 on the in domain MyST dataset and WER of 10.38 on the out of domain PFSTAR dataset. We do not use any Language Models (LM) in our experiments.
翻译:尽管最近在深层学习技术方面取得了进展,但儿童语音识别仍是一项艰巨的任务。目前,自动语音识别模型需要大量的附加说明的培训数据,而这种数据很少。在这项工作中,我们探索使用ASR模型, wav2vec2, 其自我监督学习的预培训和微调配置不同,目的是提高儿童语音识别的自动化。预先培训的 wav2vec2 模型使用不同数量的儿童语音培训数据进行了微调,以发现微调儿童语音识别模型所需的最佳数据量。我们经过培训的模型在MyST数据集领域获得了8.37个最佳字差率(WER),在PFSTAR数据集领域外获得了10.38个字差率(WER),在实验中我们没有使用任何语言模型(LM)。