Self-supervised learning (SSL) of speech representations has received much attention over the last few years but most work has focused on languages and domains with an abundance of unlabeled data. However, for many languages there is a shortage even in the unlabeled data which limits the effectiveness of SSL. In this work, we focus on the problem of applying SSL to domains with limited available data by leveraging data augmentation for Wav2Vec 2.0 pretraining. Further, we propose improvements to each component of the model which result in a combined relative word error rate (WER) improvement of up to 13% compared to Wav2Vec 2.0 on Librispeech test-clean / other.
翻译:过去几年来,自我监督的语音演示学习(SSL)受到了很多关注,但大多数工作都集中在语言和领域,有大量未贴标签的数据。然而,对于许多语言来说,即使未贴标签的数据也短缺,从而限制了SSL的有效性。在这项工作中,我们侧重于通过利用Wav2Vec 2.0预培训的数据增强功能,将SSL应用到可用数据有限的领域。此外,我们建议改进模型的每个部分,使相关字差率与Librispeech测试清洁/其他版本的Wav2Vec 2.0相比,提高幅度高达13%。