XLS-R: 规模化的自我监督跨语言语言代表学习 (XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale)

Arun Babu,Changhan Wang,Andros Tjandra,Kushal Lakhotia,Qiantong Xu,Naman Goyal,Kritika Singh,Patrick von Platen,Yatharth Saraf,Juan Pino,Alexei Baevski,Alexis Conneau,Michael Auli

This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, both high and low-resource. On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7.4 BLEU over 21 translation directions into English. For speech recognition, XLS-R improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average. XLS-R also sets a new state of the art on VoxLingua107 language identification. Moreover, we show that with sufficient model size, cross-lingual pretraining can outperform English-only pretraining when translating English speech into other languages, a setting which favors monolingual pretraining. We hope XLS-R can help to improve speech processing tasks for many more languages of the world.

翻译：本文展示了基于 wav2vec 2. 2. 我们用128种语言, 以近50万小时的公开语音音频为近50万小时的2B参数来培训模型, 公共数据数量比以前已知的最大工作要大得多。我们的评价涵盖许多任务、领域、数据制度和语言, 包括高低资源。在CoVoST-2 语音翻译基准上, 我们用平均7. 4 BLEU超过21个翻译方向的英语, 改进了以往的艺术水平。关于语音识别, XLS- R 改进了此前已知的关于BABEL、 MLS、 CommonVoice 和 VoxPopuli 的最佳工作, 平均将误差率降低14-34%。 XLS-R 还设定了VoxLingua107 语言识别艺术的新状态。此外, 我们显示, 有了足够模型, 跨语言的预培训在将英语语言翻译为其他语言时, 超越了英语预修程, 这是有利于单语前训练的设置。我们希望 XLS-R 能够帮助改进世界语言的语音处理任务。

相关内容

表示学习

关注 186

表示学习是通过利用训练数据来学习得到向量表示，这可以克服人工方法的局限性。表示学习通常可分为两大类，无监督和有监督表示学习。大多数无监督表示学习方法利用自动编码器（如去噪自动编码器和稀疏自动编码器等）中的隐变量作为表示。目前出现的变分自动编码器能够更好的容忍噪声和异常值。然而，推断给定数据的潜在结构几乎是不可能的。目前有一些近似推断的策略。此外，一些无监督表示学习方法旨在近似某种特定的相似性度量。提出了一种无监督的相似性保持表示学习框架，该框架使用矩阵分解来保持成对的DTW相似性。通过学习保持DTW的shaplets，即在转换后的空间中的欧式距离近似原始数据的真实DTW距离。有监督表示学习方法可以利用数据的标签信息，更好地捕获数据的语义结构。孪生网络和三元组网络是目前两种比较流行的模型，它们的目标是最大化类别之间的距离并最小化了类别内部的距离。

【google】监督对比学习，Supervised Contrastive Learning

专知会员服务

32+阅读 · 2020年4月23日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【阿里巴巴-CVPR2020】频域学习，Learning in the Frequency Domain

专知会员服务

29+阅读 · 2020年3月14日