Computer-Assisted Pronunciation Training (CAPT) plays an important role in language learning. However, conventional CAPT methods cannot effectively use non-native utterances for supervised training because the ground truth pronunciation needs expensive annotation. Meanwhile, certain undefined nonnative phonemes cannot be correctly classified into standard phonemes. To solve these problems, we use the vector-quantized variational autoencoder (VQ-VAE) to encode the speech into discrete acoustic units in a self-supervised manner. Based on these units, we propose a novel method that integrates both discriminative and generative models. The proposed method can detect mispronunciation and generate the correct pronunciation at the same time. Experiments on the L2-Arctic dataset show that the detection F1 score is improved by 9.58% relatively compared with recognition-based methods. The proposed method also achieves a comparable word error rate (WER) and the best style preservation for mispronunciation correction compared with text-to-speech (TTS) methods.
翻译:计算机辅助读音培训(CAPT)在语言学习中起着重要作用。然而,常规CAPT方法无法有效地使用非本地语言语言来指导性培训,因为地面真理读音需要昂贵的注解。 同时,某些未定义的非本地电话无法被正确分类到标准电话中。为了解决这些问题,我们使用矢量量化变异自动读数器(VQ-VAE),以自我监督的方式将语言编码为离散音元件。基于这些单元,我们提出了一种新颖的方法,既结合了歧视性模式,又结合了基因模型。拟议的方法可以同时检测错误发音并产生正确的发音。L2-Arctic数据集的实验显示,与基于识别的方法相比,检测F1的分数提高了9.58%。拟议方法还实现了可比较的字误差率(WER)和与文本到语音(TTSS)方法相比,对错误感应进行最佳风格的保存。