Transfer learning from high-resource languages is known to be an efficient way to improve end-to-end automatic speech recognition (ASR) for low-resource languages. Pre-trained or jointly trained encoder-decoder models, however, do not share the language modeling (decoder) for the same language, which is likely to be inefficient for distant target languages. We introduce speech-to-text translation (ST) as an auxiliary task to incorporate additional knowledge of the target language and enable transferring from that target language. Specifically, we first translate high-resource ASR transcripts into a target low-resource language, with which a ST model is trained. Both ST and target ASR share the same attention-based encoder-decoder architecture and vocabulary. The former task then provides a fully pre-trained model for the latter, bringing up to 24.6% word error rate (WER) reduction to the baseline (direct transfer from high-resource ASR). We show that training ST with human translations is not necessary. ST trained with machine translation (MT) pseudo-labels brings consistent gains. It can even outperform those using human labels when transferred to target ASR by leveraging only 500K MT examples. Even with pseudo-labels from low-resource MT (200K examples), ST-enhanced transfer brings up to 8.9% WER reduction to direct transfer.
翻译:据知,从高资源语言进行传输学习是提高低资源语言终端到终端自动语音识别(ASR)的有效方法,但是,预先培训或共同培训的编码器计算器模型并不共用同一语言的语言模型(decoder),对于远方目标语言来说,这种模型可能效率较低。我们采用语音到文本翻译(ST)作为辅助任务,以纳入对目标语言的额外知识,并能够从该目标语言进行传输。具体地说,我们首先将高资源ASR誊本翻译成一种低资源语言,用来培训ST模型。ST和具体目标ASR都使用相同的关注型编码器解码架构和词汇。前一项任务则为后者提供一个完全经过培训的模型,将24.6%的字差错率(WER)降低到基线(直接从高资源ASR转移到高资源语言)。我们表明,没有必要用人文翻译来培训ST(MT)假标签带来一致的收益。我们甚至可以超越那些使用以人文标签、500-MT模型直接转换到低资源标记的人,而仅通过IMT示例将标准转换为标准。