Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles by inserting subtitle breaks compliant to specific displaying guidelines. Similar to speech translation (ST), model training requires parallel data comprising audio inputs paired with their textual translations. In SubST, however, the text has to be also annotated with subtitle breaks. So far, this requirement has represented a bottleneck for system development, as confirmed by the dearth of publicly available SubST corpora. To fill this gap, we propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion, achieving high segmentation quality in zero-shot conditions. Comparative experiments with SubST systems respectively trained on manual and automatic segmentations result in similar performance, showing the effectiveness of our approach.
翻译:字幕翻译(SubST) 的任务是通过插入符合特定显示准则的字幕断段,将语音数据自动翻译为设计完善的字幕。 与语音翻译(ST) 类似, 示范培训需要平行数据, 包括音频投入及其文本翻译。 然而, 在字幕翻译(SubST) 中, 文本也必须附带字幕断段附加注释。 到目前为止, 这一要求代表了系统开发的瓶颈, 公众可公开获取的子ST Corpora 的缺乏证实了这一点。 为了填补这一空白, 我们建议了一种方法, 将现有的ST Corpora 转换为子科技资源, 而不进行人力干预。 我们建立了一个分段模式, 通过以多式方式开发音频和文本, 实现高分段质量的零发状态。 与分别接受手动和自动分割培训的亚子科技系统进行比较实验的结果类似, 显示了我们的方法的有效性 。