This study addresses unsupervised subword modeling, i.e., learning acoustic feature representations that can distinguish between subword units of a language. We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer. The framework consists of autoregressive predictive coding (APC) as the front-end and a cross-lingual deep neural network (DNN) as the back-end. Experiments on the ABX subword discriminability task conducted with the Libri-light and ZeroSpeech 2017 databases showed that our approach is competitive or superior to state-of-the-art studies. Comprehensive and systematic analyses at the phoneme- and articulatory feature (AF)-level showed that our approach was better at capturing diphthong than monophthong vowel information, while also differences in the amount of information captured for different types of consonants were observed. Moreover, a positive correlation was found between the effectiveness of the back-end in capturing a phoneme's information and the quality of the cross-lingual phone labels assigned to the phoneme. The AF-level analysis together with t-SNE visualization results showed that the proposed approach is better than MFCC and APC features in capturing manner and place of articulation information, vowel height, and backness information. Taken together, the analyses showed that the two stages in our approach are both effective in capturing phoneme and AF information. Nevertheless, monophthong vowel information is less well captured than consonant information, which suggests that future research should focus on improving capturing monophthong vowel information.
翻译:本研究涉及未经监督的子词建模,即学习能够区分语言子词组的声学特征表征,我们提议一个两阶段学习框架,将自我监督的学习和跨语言的知识转让结合起来。这个框架包括自动递进预测编码(CPC),作为前端和跨语言深神经网络(DNN),作为后端。在ABX子字词差异化任务上,与Libri光和ZeroSpeech 2017数据库进行的实验表明,我们的方法具有竞争力或优于最先进的誓言研究。在电话和线性知识传输功能(AF)层面的全面和系统分析表明,我们的方法比单面和跨语言深神经网络信息(DNNNN)更好地捕捉 dphong,而对于不同类型调频调的信息数量也有差异。此外,发现后端捕捉手机信息的效果与跨语组的跨语言标签的质量之间有正相关关系,在手机和直线化方法中显示的正确性分析,而AFA-FS水平分析显示,在视觉和正态方法中显示,未来的信息应该以更好的速度进行。