We propose a weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech. To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words. The lack of phonetic transcriptions for L2 speech means that the model has to learn only from a weak signal of word-level mispronunciations. Because of that and due to the limited amount of mispronounced L2 speech, the model is more likely to overfit. To limit this risk, we train it in a multi-task setup. In the first task, we estimate the probabilities of word-level mispronunciation. For the second task, we use a phoneme recognizer trained on phonetically transcribed L1 speech that is easily accessible and can be automatically annotated. Compared to state-of-the-art approaches, we improve the accuracy of detecting word-level pronunciation errors in AUC metric by 30% on the GUT Isle Corpus of L2 Polish speakers, and by 21.5% on the Isle Corpus of L2 German and Italian speakers.
翻译:我们建议对非本地语言( L2) 的字级错误发音检测模式进行监管不力的测试。 为了培训这一模式, 不需要对L2 语言进行语音转录, 我们只需要标记错误发音的单词。 L2 语言缺少语音转录, 意味着该模式只能从单级错误发音的微弱信号中学习。 由于这个原因, 并且由于错误发音的L2 语言表达方式数量有限, 该模式更可能过度适用。 为了限制这一风险, 我们用多任务设置来培训它。 在第一项任务中, 我们估计了字级错误发音的概率。 在第二项任务中, 我们使用经培训的语音转录制L1 语言的电话识别器, 这很容易读取, 并且可以自动附加注释。 与最先进的方法相比, 我们提高了在AUC 中发现字级读音错误的准确度, 30% 在L2 波兰语演讲者GUT Island Corus 上, 21.5% 在意大利语 和 Lus 公司 。