Recently, end-to-end (E2E) models, which allow to take spectral vector sequences of L2 (second-language) learners' utterances as input and produce the corresponding phone-level sequences as output, have attracted much research attention in developing mispronunciation detection (MD) systems. However, due to the lack of sufficient labeled speech data of L2 speakers for model estimation, E2E MD models are prone to overfitting in relation to conventional ones that are built on DNN-HMM acoustic models. To alleviate this critical issue, we in this paper propose two modeling strategies to enhance the discrimination capability of E2E MD models, each of which can implicitly leverage the phonetic and phonological traits encoded in a pretrained acoustic model and contained within reference transcripts of the training data, respectively. The first one is input augmentation, which aims to distill knowledge about phonetic discrimination from a DNN-HMM acoustic model. The second one is label augmentation, which manages to capture more phonological patterns from the transcripts of training data. A series of empirical experiments conducted on the L2-ARCTIC English dataset seem to confirm the efficacy of our E2E MD model when compared to some top-of-the-line E2E MD models and a classic pronunciation-scoring based method built on a DNN-HMM acoustic model.
翻译:最近,终端到终端(E2E)模型(允许将L2(第二语言)学习者的频谱矢量序列作为投入,并生成相应的电话级序列作为输出),在开发错误发音检测(MD)系统方面引起了大量的研究关注;然而,由于L2讲者没有足够的标记语言数据用于模型估计,E2E MD模型容易与DNN-HMM的音响模型所建的传统语言模型过配。为了缓解这一关键问题,我们在本文件中提出两个示范战略,以加强E2EMD模型的歧视能力,其中每种模型都可隐含地利用在预先训练的声学模型中编码并分别载于培训数据参考记录中的语音和声学特性。第一个是投入增强,目的是从DNN-HMMM的音响音模型中提取关于语音歧视的知识。第二个模型是标签增强,从培训数据记录中获取更多的声学模式。在L2-ARCTMMMMMM模型上进行了一系列经验实验,在以MD-NE为主的英文最高数据制时,其效力似乎证实了我们以MD-MA-NE为基础的电子数据方法。