This work aims to automatically evaluate whether the language development of children is age-appropriate. Validated speech and language tests are used for this purpose to test the auditory memory. In this work, the task is to determine whether spoken nonwords have been uttered correctly. We compare different approaches that are motivated to model specific language structures: Low-level features (FFT), speaker embeddings (ECAPA-TDNN), grapheme-motivated embeddings (wav2vec 2.0), and phonetic embeddings in form of senones (ASR acoustic model). Each of the approaches provides input for VGG-like 5-layer CNN classifiers. We also examine the adaptation per nonword. The evaluation of the proposed systems was performed using recordings from different kindergartens of spoken nonwords. ECAPA-TDNN and low-level FFT features do not explicitly model phonetic information; wav2vec2.0 is trained on grapheme labels, our ASR acoustic model features contain (sub-)phonetic information. We found that the more granular the phonetic modeling is, the higher are the achieved recognition rates. The best system trained on ASR acoustic model features with VTLN achieved an accuracy of 89.4% and an area under the ROC (Receiver Operating Characteristic) curve (AUC) of 0.923. This corresponds to an improvement in accuracy of 20.2% and AUC of 0.309 relative compared to the FFT-baseline.
翻译:这项工作旨在自动评估儿童的语言发展是否与年龄相适应。为此目的,使用了经验证的言语和语言测试来测试听觉记忆。在这项工作中,我们的任务是确定口头非字的表达是否正确。我们比较了以模拟特定语言结构为动机的不同方法:低级别特征(FFT)、发言人嵌入器(ECAPA-TDNN)、石墨驱动嵌入器(wav2vec 2.0)和语音嵌入器(ASR声学模型)。每种方法都为VGG-类似5层的CNN分类器提供了投入。我们还检查了每个非字的适应性。对拟议系统的评价是使用不同幼儿园的语音非字的录音进行的。 ECAPA-TDNN和低级别的FFT特征并不明确地模拟电话信息; wav2vec2.0是用石墨标签培训的,我们的ASR声学模型含有(次)语文信息。我们发现,音调模型中越多颗粒式的模型,越高,越高,越高的识别率率。ASR-RAV4 的操作模型下的最佳系统。