Current leading mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition. One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2 speech. In this work, we leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained self-supervised learning (SSL) models. Specifically, we use Wav2vec 2.0 as our SSL model, and fine-tune it using original labeled L2 speech samples plus the created pseudo-labeled L2 speech samples. Our pseudo labels are dynamic and are produced by an ensemble of the online model on-the-fly, which ensures that our model is robust to pseudo label noise. We show that fine-tuning with pseudo labels achieves a 5.35% phoneme error rate reduction and 2.48% MDD F1 score improvement over a labeled-samples-only fine-tuning baseline. The proposed PL method is also shown to outperform conventional offline PL methods. Compared to the state-of-the-art MDD systems, our MDD solution produces a more accurate and consistent phonetic error diagnosis. In addition, we conduct an open test on a separate UTD-4Accents dataset, where our system recognition outputs show a strong correlation with human perception, based on accentedness and intelligibility.
翻译:在这项工作中,我们通过假标签(PL)程序来利用无标签L2语言的L2语言,并根据预先培训的自我监督学习(SSL)模式,推广微调方法。具体地说,我们使用Wav2vec 2.0作为我们的SLS模型,并使用原标签L2语言样本加上已创建的假标签L2语语调样本进行微调。我们的假标签是动态的,并且由在线模型的同义词组成,这确保我们的模型对假标签噪声具有强力。我们用假标签进行微调,实现了5.35%的电话错误率降低和2.48%的MDDF1分数改进。我们使用原始标签的L2语调样本加上已创建的假标签L2语调语调样本,加上已创建的假标签的L2语调样本。我们的假标签是动态的,由在线模型的同义组合制作的,确保我们的模型对假标签噪音的强大。我们用假标签进行微调的调,通过一个基于标签的口腔调查的系统,用一种更精确的MDDDD 测试结果,用一种我们的标准解算出一个更连续的系统。