This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2022: low-resource and dialect speech translation. For the Tunisian Arabic-English dataset (low-resource and dialect tracks), we build an end-to-end model as our joint primary submission, and compare it against cascaded models that leverage a large fine-tuned wav2vec 2.0 model for ASR. Our results show that in our settings pipeline approaches are still very competitive, and that with the use of transfer learning, they can outperform end-to-end models for speech translation (ST). For the Tamasheq-French dataset (low-resource track) our primary submission leverages intermediate representations from a wav2vec 2.0 model trained on 234 hours of Tamasheq audio, while our contrastive model uses a French phonetic transcription of the Tamasheq audio as input in a Conformer speech translation architecture jointly trained on automatic speech recognition, ST and machine translation losses. Our results highlight that self-supervised models trained on smaller sets of target data are more effective to low-resource end-to-end ST fine-tuning, compared to large off-the-shelf models. Results also illustrate that even approximate phonetic transcriptions can improve ST scores.
翻译:本文介绍了在IWSLT 2022评价运动中为两个挑战轨道开发的OTRAC Consortium翻译系统。对于突尼斯的阿拉伯文-英文数据集(低资源和方言轨道),我们建立了一个端到端模型,作为我们联合提交的首期联合文件,并将之与利用大型微调 wav2vec 2.0 模型为ASR提供大量微调 wav2vec 2.0 模型的级联模型进行比较。我们的结果表明,在我们的环境里,编审管道方法仍然非常具有竞争力,而且随着传输学习的采用,它们能够超过语音翻译(ST)的端到端模式。对于Tamasheq-法文数据集(低资源轨道),我们的主要提交文件利用了在Tamasheq音频234小时培训的wav2vec 2.0 模型的中间演示,而我们的对比模型则使用Tamasheq 音频的法文电话翻版作为投入,在自动语音识别、ST和机器翻译损失方面共同培训。我们的结果突出表明,在更小的目标数据组上培训的自上自封到端到端模式(ST-d-chilling)的模型可以更有效到上改进。