In this work, we analyzed and compared speech representations extracted from different frozen self-supervised learning (SSL) speech pre-trained models on their ability to capture articulatory features (AF) information and their subsequent prediction of phone recognition performance for within and across language scenarios. Specifically, we compared CPC, wav2vec 2.0, and HuBert. First, frame-level AF probing tasks were implemented. Subsequently, phone-level end-to-end ASR systems for phoneme recognition tasks were implemented, and the performance on the frame-level AF probing task and the phone accuracy were correlated. Compared to the conventional speech representation MFCC, all SSL pre-trained speech representations captured more AF information, and achieved better phoneme recognition performance within and across languages, with HuBert performing best. The frame-level AF probing task is a good predictor of phoneme recognition performance, showing the importance of capturing AF information in the speech representations. Compared with MFCC, in the within-language scenario, the performance of these SSL speech pre-trained models on AF probing tasks achieved a maximum relative increase of 34.4%, and it resulted in the lowest PER of 10.2%. In the cross-language scenario, the maximum relative increase of 26.7% also resulted in the lowest PER of 23.0%.
翻译:在这项工作中,我们分析并比较了从不同冷冻的自我监督学习(SSL)演讲预培训模型中摘取的演讲演示,这些模型涉及它们获取动脉特征(AF)信息的能力,以及随后对语言范围内和跨语言情景的电话识别性能的预测。具体地说,我们比较了CPC, wav2vec2.0和HuBert。首先,框架一级的AF演示任务得到了执行。随后,落实了电话识别任务的电话级终端至终端ASR系统,在框架一级AF检验任务和电话准确度方面的表现是相互关联的。与常规演讲代表MFCC相比,所有SL预先培训的语音演示都收集了更多的AFF信息,在语言内部和跨语言中实现了更好的电话识别性表现。框架一级的AFFOR测试任务是电话识别性表现的良好预测,表明在语音描述中捕获AFFFA信息的重要性。与MFC公司在框架一级测试任务和电话准确度的测试性能。与FAFC公司测试前的模型相比,所有SLAFC预培训的绩效都获得了更多的AFFC信息,所有AFCAFC的信息,在语言内部和跨语言中取得了更多的最大比例增长率中实现了最大比例增长率增长。