As an indispensable ingredient of computer-assisted pronunciation training (CAPT), automatic pronunciation assessment (APA) plays a pivotal role in aiding self-directed language learners by providing multi-aspect and timely feedback. However, there are at least two potential obstacles that might hinder its performance for practical use. On one hand, most of the studies focus exclusively on leveraging segmental (phonetic)-level features such as goodness of pronunciation (GOP); this, however, may cause a discrepancy of feature granularity when performing suprasegmental (prosodic)-level pronunciation assessment. On the other hand, automatic pronunciation assessments still suffer from the lack of large-scale labeled speech data of non-native speakers, which inevitably limits the performance of pronunciation assessment. In this paper, we tackle these problems by integrating multiple prosodic and phonological features to provide a multi-view, multi-granularity, and multi-aspect (3M) pronunciation modeling. Specifically, we augment GOP with prosodic and self-supervised learning (SSL) features, and meanwhile develop a vowel/consonant positional embedding for a more phonology-aware automatic pronunciation assessment. A series of experiments conducted on the publicly-available speechocean762 dataset show that our approach can obtain significant improvements on several assessment granularities in comparison with previous work, especially on the assessment of speaking fluency and speech prosody.
翻译:作为计算机辅助读音培训(CAPT)不可或缺的组成部分,自动读音评估(APA)通过提供多方面和及时的反馈,在帮助自导语言学习者方面发挥着关键作用,然而,至少有两个潜在障碍可能阻碍其实际使用。一方面,大多数研究完全侧重于利用诸如读音良好(GOP)等部分(语音)级特征;然而,这在进行超级(Prosodic)级读音评估时,可能造成特征颗粒性差异。另一方面,自动发音评估仍然因为缺乏非母语演讲者的大规模有标签的语音数据而受到影响,这不可避免地限制了读音评估的绩效。在本文中,我们通过综合多种发音和声学特征来解决这些问题,以提供多视角、多组合性和多功能(M)级读音模型。具体地说,我们增加GOP的预言和自导读性语言评估,特别是非母语演讲者的大规模语音数据数据数据数据分析,同时发展了我们先前的读性读性(SSL)变压性读性读性语言方法,并同时发展了我们以前几期的读性读性读性读性研究(SS-S-RO性研究)的读性研究的读性实验。