Recent studies on pronunciation scoring have explored the effect of introducing phone embeddings as reference pronunciation, but mostly in an implicit manner, i.e., addition or concatenation of reference phone embedding and actual pronunciation of the target phone as the phone-level pronunciation quality representation. In this paper, we propose to use linguistic-acoustic similarity to explicitly measure the deviation of non-native production from its native reference for pronunciation assessment. Specifically, the deviation is first estimated by the cosine similarity between reference phone embedding and corresponding acoustic embedding. Next, a phone-level Goodness of pronunciation (GOP) pre-training stage is introduced to guide this similarity-based learning for better initialization of the aforementioned two embeddings. Finally, a transformer-based hierarchical pronunciation scorer is used to map a sequence of phone embeddings, acoustic embeddings along with their similarity measures to predict the final utterance-level score. Experimental results on the non-native databases suggest that the proposed system significantly outperforms the baselines, where the acoustic and phone embeddings are simply added or concatenated. A further examination shows that the phone embeddings learned in the proposed approach are able to capture linguistic-acoustic attributes of native pronunciation as reference.
翻译:最近对读音评分的研究探索了引入手机嵌入作为参考读音的参考读音的效应,但大多是隐含的,即作为电话级发音质量表示,将目标电话的参考电话嵌入和实际发音化作为电话级发音质量表示法。在本文中,我们提议使用语言声学相似性来明确测量非本地生产与其本地读音评估参考值的偏差。具体地说,偏差首先根据参考电话嵌入和相应声学嵌入之间的共生相似性来估计。接下来,引入了电话级的预培训阶段,以指导基于类似性的学习,更好地初始化上述两个嵌入式。最后,我们建议使用基于变压器的等级发音计分数来绘制手机嵌入序列,声频嵌入及其预测最后发音级评分的类似措施。非本地数据库的实验结果显示,拟议的系统大大超越了预发音(GPP)预留音阶段的好性(GP),以指导基于类似性的学习方式学习上述两个嵌入。最后,采用基于变压的手机测试的嵌入式定位方法,以进一步显示磁测测测测。