This paper describes the submission of our end-to-end YiTrans speech translation system for the IWSLT 2022 offline task, which translates from English audio to German, Chinese, and Japanese. The YiTrans system is built on large-scale pre-trained encoder-decoder models. More specifically, we first design a multi-stage pre-training strategy to build a multi-modality model with a large amount of labeled and unlabeled data. We then fine-tune the corresponding components of the model for the downstream speech translation tasks. Moreover, we make various efforts to improve performance, such as data filtering, data augmentation, speech segmentation, model ensemble, and so on. Experimental results show that our YiTrans system obtains a significant improvement than the strong baseline on three translation directions, and it achieves +5.2 BLEU improvements over last year's optimal end-to-end system on tst2021 English-German. Our final submissions rank first on English-German and English-Chinese end-to-end systems in terms of the automatic evaluation metric. We make our code and models publicly available.
翻译:本文介绍我们为IWSLT 2022 离线任务提交的端到端 YiTrans 语音翻译系统,该系统由英文音频翻译为德文、中文和日文翻译。 YiTrans 系统是建立在大规模预先训练的编码编码解码模型基础上的。更具体地说,我们首先设计一个多阶段的培训前战略,以建立具有大量标签和无标签数据的多模式模式。然后我们细化下游语音翻译任务模式的相应组成部分。此外,我们还作出各种努力改进绩效,例如数据过滤、数据增强、语音分解、模块组合等。实验结果显示,我们的YiTrans系统比三个翻译方向的强大基线有显著改进,而且它比去年英语-德语的最佳端对端系统改进了5.2 BLEU。我们提交的最后文件在自动评价指标方面首先排在英语-德语和英语-中文端对端系统上。我们公布了我们的代码和模型。