We present a method for transferring pre-trained self-supervised (SSL) speech representations to multiple languages. There is an abundance of unannotated speech, so creating self-supervised representations from raw audio and fine-tuning on small annotated datasets is a promising direction to build speech recognition systems. SSL models generally perform SSL on raw audio in a pre-training phase and then fine-tune on a small fraction of annotated data. Such models have produced state of the art results for ASR. However, these models are very expensive to pre-train. We use an existing wav2vec 2.0 model and tackle the problem of learning new language representations while utilizing existing model knowledge. Crucially we do so without catastrophic forgetting of the existing language representation. We use adapter modules to speed up pre-training a new language task. Our model can decrease pre-training times by 32% when learning a new language task, and learn this new audio-language representation without forgetting previous language representation. We evaluate by applying these language representations to automatic speech recognition.
翻译:我们提出了一个将经过训练的自我监督的演讲演示转移到多种语言的方法。 有大量未经附加说明的演讲,因此,通过原始音频和微调对小的附加说明的数据集进行自我监督的演示,是建立语音识别系统的大有希望的方向。 SSL模式通常在培训前阶段对原始音频进行SSL操作,然后对一小部分附加说明的数据进行微调。 这些模型产生了ASR的最新结果。 然而,这些模型对于培训前阶段来说是非常昂贵的。 我们使用现有的 wav2vec 2. 0 模型,并解决学习新语言表述的问题,同时利用现有的模型知识。 关键是我们这样做时没有灾难性地忘记现有的语言代表。 我们使用适应模块来加快新语言的预培训任务。 我们的模型在学习新的语言任务时可以将预培训时间减少32%,并在不忘记先前语言代表的情况下学习这种新的音频表达方式。 我们通过应用这些语言表达方式进行自动语音识别来评估。