One of the main challenges for end-to-end speech translation is data scarcity. We leverage pseudo-labels generated from unlabeled audio by a cascade and an end-to-end speech translation model. This provides 8.3 and 5.7 BLEU gains over a strong semi-supervised baseline on the MuST-C English-French and English-German datasets, reaching state-of-the art performance. The effect of the quality of the pseudo-labels is investigated. Our approach is shown to be more effective than simply pre-training the encoder on the speech recognition task. Finally, we demonstrate the effectiveness of self-training by directly generating pseudo-labels with an end-to-end model instead of a cascade model.
翻译:终端到终端语音翻译面临的主要挑战之一是数据稀缺。 我们利用由级联和终端到终端语音翻译模型产生的未贴标签音频产生的伪标签。 这提供了在穆斯特-C英语-法语和英语-德语数据集的强有力的半监督基线上获得的8.3和5.7 BLEU收益,达到了最新业绩。 正在调查伪标签质量的影响。 我们的方法比仅仅对编码员进行语音识别任务前培训更有效。 最后,我们通过直接生成带有端到终端模型的伪标签而不是级联模型来展示自我培训的有效性。