This paper presents a unified end-to-end frame-work for both streaming and non-streamingspeech translation. While the training recipes for non-streaming speech translation have been mature, the recipes for streaming speechtranslation are yet to be built. In this work, wefocus on developing a unified model (UniST) which supports streaming and non-streaming ST from the perspective of fundamental components, including training objective, attention mechanism and decoding policy. Experiments on the most popular speech-to-text translation benchmark dataset, MuST-C, show that UniST achieves significant improvement for non-streaming ST, and a better-learned trade-off for BLEU score and latency metrics for streaming ST, compared with end-to-end baselines and the cascaded models. We will make our codes and evaluation tools publicly available.
翻译:本文件为流译和非流译翻译提供了一个统一的端对端框架工作框架。虽然非流译语音翻译的培训食谱已经成熟,但流译语音翻译的食谱尚未建立。在这项工作中,我们侧重于开发一个统一的模型(UniST),从培训目标、关注机制和解码政策等基本组成部分的角度支持流和非流出ST。关于最受欢迎的语音对文本翻译基准数据集(MuST-C)的实验显示,UniST在非流译ST方面取得了显著的改进,并且与端到端基线和分级模型相比,BLEU的分数和流流的延时度衡量标准取得了更好的取舍。我们将公布我们的代码和评价工具。