A key desiderata for inclusive and accessible speech recognition technology is ensuring its robust performance to children's speech. Notably, this includes the rapidly advancing neural network based end-to-end speech recognition systems. Children speech recognition is more challenging due to the larger intra-inter speaker variability in terms of acoustic and linguistic characteristics compared to adult speech. Furthermore, the lack of adequate and appropriate children speech resources adds to the challenge of designing robust end-to-end neural architectures. This study provides a critical assessment of automatic children speech recognition through an empirical study of contemporary state-of-the-art end-to-end speech recognition systems. Insights are provided on the aspects of training data requirements, adaptation on children data, and the effect of children age, utterance lengths, different architectures and loss functions for end-to-end systems and role of language models on the speech recognition performance.
翻译:对于包容性和无障碍的语音识别技术而言,关键的包容性和无障碍语音识别技术是确保其在儿童言论方面的有力表现。值得注意的是,这包括以终端至终端语音识别系统为基础的快速推进的神经网络系统。儿童语音识别更具挑战性,因为与成人言论相比,语言语言和语言特征在声学和语言特征方面的差异更大。此外,缺乏充足和适当的儿童语音资源增加了设计稳健的终端至终端神经结构的挑战。本研究报告通过对当代最先进的终端至终端语音识别系统进行实证研究,对自动儿童语音识别进行了批判性评估。它提供了关于培训数据要求、儿童数据适应、儿童年龄的影响、超长、终端至终端系统的不同结构和损失功能以及语言模型对语音识别表现的作用等方面的观点。