Unsupervised pre-training is now the predominant approach for both text and speech understanding. Self-attention models pre-trained on large amounts of unannotated data have been hugely successful when fine-tuned on downstream tasks from a variety of domains and languages. This paper takes the universality of unsupervised language pre-training one step further, by unifying speech and text pre-training within a single model. We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech. To further align our model representations across modalities, we leverage alignment losses, specifically Translation Language Modeling (TLM) and Speech Text Matching (STM) that make use of supervised speech-text recognition data. We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST~2 speech translation, by around 1 BLEU compared to single-modality pre-trained models, while retaining close to SotA performance on LibriSpeech and SpeechStew ASR tasks. On four GLUE tasks and text-normalization, we observe evidence of capacity limitations and interference between the two modalities, leading to degraded performance compared to an equivalent text-only model, while still being competitive with BERT. Through extensive empirical analysis we also demonstrate the importance of the choice of objective function for speech pre-training, and the beneficial effect of adding additional supervised signals on the quality of the learned representations.
翻译:未经监督的培训前培训是目前对文本和语言理解的主导方法。在大量未经附加说明的数据方面经过预先培训的自我关注模式,在对来自不同领域和语言的下游任务进行微调后,在对大量未经注解的数据进行微调后,已经取得了巨大成功。本文件将未经监督的语言培训前培训的普及程度进一步一步,在单一模式中统一了语言和文字培训前培训的普及程度。我们用未加标注的文本和未加标注的演讲的W2V-BERT目标建立一个单一的编码器。为了进一步调整我们的模式在各种模式中进行模拟性陈述,我们利用了调整损失,特别是翻译语言模型和语音文本匹配(STM),利用了受监督的语音文本识别数据。我们证明,在培训前将语言和文本数据都纳入后,可以大大提高COVST~2语言前培训前翻译的下游质量,大约1个BLEUEU,而采用单一模式的预先培训模式模式,同时保持Speech和Speress-SR的任务在Lial Redustrual Redunial Redustrual Redustress中,我们观察了两种模式和具有广泛性职能的证据限制。