For real-life applications, it is crucial that end-to-end spoken language translation models perform well on continuous audio, without relying on human-supplied segmentation. For online spoken language translation, where models need to start translating before the full utterance is spoken, most previous work has ignored the segmentation problem. In this paper, we compare various methods for improving models' robustness towards segmentation errors and different segmentation strategies in both offline and online settings and report results on translation quality, flicker and delay. Our findings on five different language pairs show that a simple fixed-window audio segmentation can perform surprisingly well given the right conditions.
翻译:对于实际应用而言,关键是端对端口语翻译模式在连续的音频上运行良好,而不必依赖人供应的分化。 对于在线口语翻译,在全语发音之前需要开始翻译的模式,大多数先前的工作忽视了分化问题。 在本文中,我们比较了各种方法,以提高模型对分化错误和离线和在线设置中不同分化战略的稳健性,并报告了翻译质量、闪烁和延迟的结果。 我们对五对不同语言的研究结果显示,如果条件合适,简单的固定式音频分化可以发挥出人意料的很好的效果。