This paper describes the SPAPL system for the INTERSPEECH 2021 Challenge: Shared Task on Automatic Speech Recognition for Non-Native Children's Speech in German. ~ 5 hours of transcribed data and ~ 60 hours of untranscribed data are provided to develop a German ASR system for children. For the training of the transcribed data, we propose a non-speech state discriminative loss (NSDL) to mitigate the influence of long-duration non-speech segments within speech utterances. In order to explore the use of the untranscribed data, various approaches are implemented and combined together to incrementally improve the system performance. First, bidirectional autoregressive predictive coding (Bi-APC) is used to learn initial parameters for acoustic modelling using the provided untranscribed data. Second, incremental semi-supervised learning is further used to iteratively generate pseudo-transcribed data. Third, different data augmentation schemes are used at different training stages to increase the variability and size of the training data. Finally, a recurrent neural network language model (RNNLM) is used for rescoring. Our system achieves a word error rate (WER) of 39.68% on the evaluation data, an approximately 12% relative improvement over the official baseline (45.21%).
翻译:本文介绍了INSPEECH 2021挑战的SPAPL系统:德国语非母语儿童演讲自动语音识别共同任务。 提供了5小时转录数据和60小时未转录数据,以开发德国儿童ASR系统。 为培训转录数据,我们提议了非语音国家歧视损失系统(NSDL),以缓解语音话语中长期非语音部分的影响。为探索未调出数据的使用情况,采用并合并了各种方法,以逐步改进系统性能。首先,使用双向自动递增预测编码(BI-APC)来学习使用未转录数据进行声学模拟的初步参数。第二,增加的半监控学习被进一步用于迭代生成伪调数据。第三,在不同的培训阶段使用不同的数据增强计划来增加培训数据的变异性和规模。最后,一个经常性的线性语言网络模型(RNNNLMER)(RNLER) 用于使用大约12 % 的在线数据基准率。