We present a method for continual learning of speech representations for multiple languages using self-supervised learning (SSL) and applying these for automatic speech recognition. There is an abundance of unannotated speech, so creating self-supervised representations from raw audio and finetuning on a small annotated datasets is a promising direction to build speech recognition systems. Wav2vec models perform SSL on raw audio in a pretraining phase and then finetune on a small fraction of annotated data. SSL models have produced state of the art results for ASR. However, these models are very expensive to pretrain with self-supervision. We tackle the problem of learning new language representations continually from audio without forgetting a previous language representation. We use ideas from continual learning to transfer knowledge from a previous task to speed up pretraining a new language task. Our continual-wav2vec2 model can decrease pretraining times by 32% when learning a new language task, and learn this new audio-language representation without forgetting previous language representation.
翻译:我们提出一种方法,用自我监督的学习(SSL)持续学习多语种的语音演示,并将这些演示应用到自动语音识别中。有很多未经附加说明的演讲,因此,通过原始音频和微调在小的附加说明的数据集上创建自监管的演示,是建立语音识别系统的大方向。Wav2vec模型在培训前阶段对原始音频进行SSL操作,然后对一小部分附加说明的数据进行微调。SSL模型为ASR制作了最新的最新结果。然而,这些模型对于自我监督的预演来说非常昂贵。我们在不忘记先前语言代表的情况下不断从音频中学习新语言演示的问题。我们从继续学习知识的角度来转让知识,以加快新语言的预培训任务。我们的持续Wav2vec2模型在学习新语言任务时可以减少预培训时间的32%,并在不忘记先前语言代表的情况下学习这种新语言表述方式。