Phone-level pronunciation scoring is a challenging task, with performance far from that of human annotators. Standard systems generate a score for each phone in a phrase using models trained for automatic speech recognition (ASR) with native data only. Better performance has been shown when using systems that are trained specifically for the task using non-native data. Yet, such systems face the challenge that datasets labelled for this task are scarce and usually small. In this paper, we present a transfer learning-based approach that leverages a model trained for ASR, adapting it for the task of pronunciation scoring. We analyze the effect of several design choices and compare the performance with a state-of-the-art goodness of pronunciation (GOP) system. Our final system is 20% better than the GOP system on EpaDB, a database for pronunciation scoring research, for a cost function that prioritizes low rates of unnecessary corrections.
翻译:电话级发音评分是一项具有挑战性的任务,其性能远不及人类发音员。 标准系统使用仅用本地数据进行自动语音识别(ASR)培训的模型生成每部电话的评分。 当使用专门为使用非本地数据进行任务培训的系统时,表现会更好。 然而,这些系统面临的挑战是,为此任务标注的数据集非常稀少,而且通常很小。 在本文中,我们提出了一个基于转移的学习方法,利用为ASR培训的模型,对它进行调整,以适应发音评分任务。 我们分析了几种设计选择的效果,并将性能与最新发音质量(GOP)系统进行比较。 我们的最终系统比EpaDB的GOP系统(一个发音评分研究数据库)要好20%,这个系统的成本功能优先考虑低的不必要校正率。