Speech segmentation, which splits long speech into short segments, is essential for speech translation (ST). Popular VAD tools like WebRTC VAD have generally relied on pause-based segmentation. Unfortunately, pauses in speech do not necessarily match sentence boundaries, and sentences can be connected by a very short pause that is difficult to detect by VAD. In this study, we propose a speech segmentation method using a binary classification model trained using a segmented bilingual speech corpus. We also propose a hybrid method that combines VAD and the above speech segmentation method. Experimental results revealed that the proposed method is more suitable for cascade and end-to-end ST systems than conventional segmentation methods. The hybrid approach further improved the translation performance.
翻译:将长篇发言分成短篇部分,对于语言翻译至关重要。WebRTC VAD等流行性VAD工具一般都依赖暂停式分割。 不幸的是,暂停语句不一定与句号界限相符,而句子可以通过极短的暂停连接,而这种暂停很难被VAD发现。在本研究报告中,我们建议使用使用使用分段双语语言材料培训的二进制分类模式来使用语言分割法。我们还提议一种混合法,将VAD和上述语言分割法结合起来。实验结果表明,拟议的方法比常规分割法更适合级联和端至端ST系统。混合法进一步提高了翻译的性能。