With the acceleration of globalization, more and more people are willing or required to learn second languages (L2). One of the major remaining challenges facing current mispronunciation and diagnosis (MDD) models for use in computer-assisted pronunciation training (CAPT) is to handle speech from L2 learners with a diverse set of accents. In this paper, we set out to mitigate the adverse effects of accent variety in building an L2 English MDD system with end-to-end (E2E) neural models. To this end, we first propose an effective modeling framework that infuses accent features into an E2E MDD model, thereby making the model more accent-aware. Going a step further, we design and present disparate accent-aware modules to perform accent-aware modulation of acoustic features in a fine-grained manner, so as to enhance the discriminating capability of the resulting MDD model. Extensive sets of experiments conducted on the L2-ARCTIC benchmark dataset show the merits of our MDD model, in comparison to some existing E2E-based strong baselines and the celebrated pronunciation scoring based method.
翻译:随着全球化的加速,越来越多的人愿意或需要学习第二语言(L2),目前用于计算机辅助发音培训(CAPT)的发音和诊断模型(MDD)所面临的主要挑战之一是处理L2学生具有多种口音的演讲;在本文件中,我们提出要减轻口音差异在用终端到终端神经模型建立L2英语MDD系统时产生的不利影响;为此,我们首先提出一个有效的建模框架,将口音特征输入E2E MDD模型,从而使模型更加通气;进一步,我们设计和提出不同的口音-觉模块,以微调方式对音特征进行口音调调整,以加强由此产生的MDD模型的区分能力;在L2-ARCTIC基准数据集上进行的一系列广泛实验,显示我们MDD模型的优点,与一些基于E2E的强大基准和以节制的推进性记录方法相比,显示我们MDDD模型的优点。