Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages. This is done by mapping phonemes of the training languages to the target language using articulatory features. Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures and used only part of a monolingually pretrained model.
翻译:在自我培训、自我监督的训练前和不受监督的学习方面最近取得的进展,使得在没有任何标签数据的情况下能够很好地运行语音识别系统;然而,在许多情况下,有关语言的标签数据没有被这些方法使用;本文件通过微调多语种预先培训的 wav2vec 2.0 模式来改写看不见的语言,从而扩展了以前关于零弹跨语言交流学习的工作。这是通过利用动脉特征将培训语言的电话用到目标语言进行绘图完成的。实验表明,这一简单方法大大优于先前引入特定任务结构的工作,而且只使用单一语言预先培训模式的一部分。