Applications designed for simultaneous speech translation during events such as conferences or meetings need to balance quality and lag while displaying translated text to deliver a good user experience. One common approach to building online spoken language translation systems is by leveraging models built for offline speech translation. Based on a technique to adapt end-to-end monolingual models, we investigate multilingual models and different architectures (end-to-end and cascade) on the ability to perform online speech translation. On the multilingual TEDx corpus, we show that the approach generalizes to different architectures. We see similar gains in latency reduction (40% relative) across languages and architectures. However, the end-to-end architecture leads to smaller translation quality losses after adapting to the online model. Furthermore, the approach even scales to zero-shot directions.
翻译:在诸如会议或会议等活动期间,同时设计语音翻译的应用需要平衡质量和时滞,同时展示翻译文本以提供良好的用户经验。建立在线口语翻译系统的一个共同办法是利用为离线语音翻译所建立的模式。基于调整端对端单语模式的技术,我们调查关于进行在线语音翻译能力的多语种模式和不同结构(端对端和级联)。关于多语种TEDx文集,我们显示该方法概括了不同的结构。我们看到不同语言和结构在延缓性减少(相对40%)方面也有类似进展。然而,终端对端结构在适应在线模式后导致更小的翻译质量损失。此外,甚至将方法尺度推向零速方向。