This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, both high and low-resource. On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7.4 BLEU over 21 translation directions into English. For speech recognition, XLS-R improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average. XLS-R also sets a new state of the art on VoxLingua107 language identification. Moreover, we show that with sufficient model size, cross-lingual pretraining can outperform English-only pretraining when translating English speech into other languages, a setting which favors monolingual pretraining. We hope XLS-R can help to improve speech processing tasks for many more languages of the world.
翻译:本文展示了基于 wav2vec 2. 2. 我们用128种语言, 以近50万小时的公开语音音频为近50万小时的2B参数来培训模型, 公共数据数量比以前已知的最大工作要大得多。 我们的评价涵盖许多任务、 领域、 数据制度和语言, 包括高低资源。 在CoVoST-2 语音翻译基准上, 我们用平均7. 4 BLEU超过21个翻译方向的英语, 改进了以往的艺术水平。 关于语音识别, XLS- R 改进了此前已知的关于BABEL、 MLS、 CommonVoice 和 VoxPopuli 的最佳工作, 平均将误差率降低14-34%。 XLS-R 还设定了VoxLingua107 语言识别艺术的新状态。 此外, 我们显示, 有了足够模型, 跨语言的预培训在将英语语言翻译为其他语言时, 超越了英语预修程, 这是有利于单语前训练的设置 。 我们希望 XLS-R 能够帮助改进世界语言的语音处理任务。