Despite significant advances in recent years, the existing Computer-Assisted Pronunciation Training (CAPT) methods detect pronunciation errors with a relatively low accuracy (precision of 60% at 40%-80% recall). This Ph.D. work proposes novel deep learning methods for detecting pronunciation errors in non-native (L2) English speech, outperforming the state-of-the-art method in AUC metric (Area under the Curve) by 41%, i.e., from 0.528 to 0.749. One of the problems with existing CAPT methods is the low availability of annotated mispronounced speech needed for reliable training of pronunciation error detection models. Therefore, the detection of pronunciation errors is reformulated to the task of generating synthetic mispronounced speech. Intuitively, if we could mimic mispronounced speech and produce any amount of training data, detecting pronunciation errors would be more effective. Furthermore, to eliminate the need to align canonical and recognized phonemes, a novel end-to-end multi-task technique to directly detect pronunciation errors was proposed. The pronunciation error detection models have been used at Amazon to automatically detect pronunciation errors in synthetic speech to accelerate the research into new speech synthesis methods. It was demonstrated that the proposed deep learning methods are applicable in the tasks of detecting and reconstructing dysarthric speech.
翻译:尽管近年来取得了显著进展,但现有的计算机辅助读音培训方法(CAPT)发现发音错误的精确度相对较低(精确度为60%,为40%-80%回想)。本博士的工作提出了新的深层次的学习方法,以探测非母语(L2)英语言语中的发音错误,比AUC指标(Curve之下区域)中最先进的发音方法高41%,即从0.528到0.749。现有的CAPT方法的一个问题是,对发音错误检测模型进行可靠培训所需的附加说明的错发音错误(精确度为40%-80%回想)。因此,对发音错误的检测被改写为合成语言(L2)英语言语中的发音错误),比AUCUC指标(CUC(Curve之下区域)中最先进的发音方法高41%,即从0.528到0.749。此外,为了消除调和公认的语音电话的需要,一种新型的结束多塔克语言技术,以便直接检测发音错误,在亚马逊州语音检测中采用自动测读作方法,因此,正在采用新的测算方法。