This paper describes the NTNU ASR system participating in the Interspeech 2020 Non-Native Children's Speech ASR Challenge supported by the SIG-CHILD group of ISCA. This ASR shared task is made much more challenging due to the coexisting diversity of non-native and children speaking characteristics. In the setting of closed-track evaluation, all participants were restricted to develop their systems merely based on the speech and text corpora provided by the organizer. To work around this under-resourced issue, we built our ASR system on top of CNN-TDNNF-based acoustic models, meanwhile harnessing the synergistic power of various data augmentation strategies, including both utterance- and word-level speed perturbation and spectrogram augmentation, alongside a simple yet effective data-cleansing approach. All variants of our ASR system employed an RNN-based language model to rescore the first-pass recognition hypotheses, which was trained solely on the text dataset released by the organizer. Our system with the best configuration came out in second place, resulting in a word error rate (WER) of 17.59 %, while those of the top-performing, second runner-up and official baseline systems are 15.67%, 18.71%, 35.09%, respectively.
翻译:本文介绍了参与2020年无母语儿童演讲的跨语音2020年非母语儿童演讲ASR挑战的NTNU ASR系统。由于非本地和儿童说话特点的多样性并存,这项共同任务变得更具有挑战性。在设定闭路评估时,所有参与者都局限于仅仅根据组织者提供的讲话和文本组合来开发自己的系统。为了围绕这一资源不足的问题开展工作,我们在CNN-TDNNF基于CNN-TDNNF的音响模型之上建立了我们的ASR系统,同时利用各种数据增强战略的协同能力,包括发音和字级速度渗透和光谱增强,以及简单而有效的数据清理方法。我们的ASR系统的所有变体都使用了基于RNNE的语文模型来重新定位第一流识别假设,仅用组织者发布的文本数据集来培训。我们采用最佳配置的系统排在第二位,导致17.59 %的字差率(WER),分别为15.09 % 和18.09 % 的官方系统最高运行率为15.9%,最高运行率为18.9%。