The audio segmentation mismatch between training data and those seen at run-time is a major problem in direct speech translation. Indeed, while systems are usually trained on manually segmented corpora, in real use cases they are often presented with continuous audio requiring automatic (and sub-optimal) segmentation. After comparing existing techniques (VAD-based, fixed-length and hybrid segmentation methods), in this paper we propose enhanced hybrid solutions to produce better results without sacrificing latency. Through experiments on different domains and language pairs, we show that our methods outperform all the other techniques, reducing by at least 30% the gap between the traditional VAD-based approach and optimal manual segmentation.
翻译:培训数据与在运行时看到的数据之间的音频分离不匹配是直接语音翻译中的一个主要问题。 事实上,虽然系统通常在人工分割的子公司方面受过培训,但在实际使用情况下,系统往往具有连续的音频要求自动(和亚最佳)分割。 在比较了现有技术(基于VAD的、固定长度的和混合分割方法)之后,我们在本文件中提出了在不牺牲耐久性的情况下产生更好效果的强化混合解决方案。 通过在不同领域和语言配对的实验,我们发现我们的方法比所有其他技术都好,至少将基于VAD的传统方法与最佳手工分割之间的差距缩小30%。