Unsupervised representation learning for speech audios attained impressive performances for speech recognition tasks, particularly when annotated speech is limited. However, the unsupervised paradigm needs to be carefully designed and little is known about what properties these representations acquire. There is no guarantee that the model learns meaningful representations for valuable information for recognition. Moreover, the adaptation ability of the learned representations to other domains still needs to be estimated. In this work, we explore learning domain-invariant representations via a direct mapping of speech representations to their corresponding high-level linguistic informations. Results prove that the learned latents not only capture the articulatory feature of each phoneme but also enhance the adaptation ability, outperforming the baseline largely on accented benchmarks.
翻译:在语音识别任务方面,特别是在附加说明的演讲有限的情况下,对获得令人印象深刻的语音音频进行不受监督的代言学习,取得了令人印象深刻的成绩。然而,需要谨慎设计不受监督的范式,对这些表述获得的属性知之甚少。不能保证模型能够为宝贵的信息获得有意义的表述,以获得重要的识别信息。此外,还需要估计学习到的表达方式在其他领域的适应能力。在这项工作中,我们探索通过对相应的高层次语言信息进行语音表达方式的直接绘图来学习域差异表达方式。结果证明,所学到的潜质不仅能够捕捉到每个电话线段的动脉特征,而且能够提高适应能力,基本上以强调的基准为主。