Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT). In the field of assessing the pronunciation quality of constrained speech, the given transcriptions can play the role of a teacher. Conventional methods have fully utilized the prior texts for the model construction or improving the system performance, e.g. forced-alignment and extended recognition networks. Recently, some end-to-end based methods attempt to incorporate the prior texts into model training and preliminarily show the effectiveness. However, previous studies mostly consider applying raw attention mechanism to fuse audio representations with text representations, without taking possible text-pronunciation mismatch into account. In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information. Moreover, given the transcriptions, we design an extra contrastive loss to reduce the gap between the learning objective of phoneme recognition and MDD. We conducted experiments using two publicly available datasets (TIMIT and L2-Arctic) and our best model improved the F1 score from $57.51\%$ to $61.75\%$ compared to the baselines. Besides, we provide a detailed analysis to shed light on the effectiveness of gating mechanism and contrastive learning on MDD.
翻译:发音检测和诊断技术是计算机辅助发音培训系统(CAPT)的一个关键组成部分。在评估限制言词发音质量方面,给定的笔录可以发挥教师的作用。常规方法充分利用了建模或改进系统性能的先前文本,例如强迫调整和扩展识别网络。最近,一些基于端对端的方法试图将先前的文本纳入示范培训和初步展示效果。然而,以往的研究大多考虑在不考虑可能的文本发音错配的情况下,对装有文字表述的音频表示采用原始关注机制。我们在本文件中提出一个战略,在压制不相关的文字信息的同时,更加重视相关的音频特性。此外,我们设计了额外的对比性损失,以缩小电话识别学习目标与MDD之间的差距。我们利用两个公开的数据集(TIMIT和L2-Arctict)进行了实验,我们的最佳模型改进了F1评分,从57.51美美分提高到61美美分,比照了我们更深入的M75美德基准。