Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing. Recent works also investigated SSL from speech. They were notably successful to improve performance on downstream tasks such as automatic speech recognition (ASR). While these works suggest it is possible to reduce dependence on labeled data for building efficient speech systems, their evaluation was mostly made on ASR and using multiple and heterogeneous experimental settings (most of them for English). This questions the objective comparison of SSL approaches and the evaluation of their impact on building speech systems. In this paper, we propose LeBenchmark: a reproducible framework for assessing SSL from speech. It not only includes ASR (high and low resource) tasks but also spoken language understanding, speech translation and emotion recognition. We also focus on speech technologies in a language different than English: French. SSL models of different sizes are trained from carefully sourced and documented datasets. Experiments show that SSL is beneficial for most but not all tasks which confirms the need for exhaustive and reliable benchmarks to evaluate its real impact. LeBenchmark is shared with the scientific community for reproducible research in SSL from speech.
翻译:在图像和自然语言处理方面,已经成功地探索了使用大量未贴标签数据进行图像和自然语言处理的自我监督学习(SSL) 。最近的工作还从语言学角度对SSL进行了调查,显著成功地改进了诸如自动语音识别(ASR)等下游任务的业绩。虽然这些工作表明有可能减少对标签数据的依赖,以建立高效语音系统,但其评价大多是在ASR上进行,并使用多种和多种不同的实验设置(大部分是英文)。这质疑对SSL方法的客观比较以及对其建设语音系统的影响的评价。在本文中,我们提议了LeBenchmark:一个从语言上评估SSL的可复制框架。它不仅包括ASR(高低资源)任务,而且还包括口头语言理解、语音翻译和情感识别。我们还侧重于一种不同于英语的语言的语音技术:法语。不同大小的SLSL模型从仔细的来源和有记录的数据集中受训。实验表明,SLSL对大多数但并非全部任务都有好处,这证实了评价其真实影响需要详尽可靠的基准。LBechmarkmark与S的科学界共享。SL的关于SL可重新进行可复制研究。LSL的共享。