Speech translation (ST) has lately received growing interest for the generation of subtitles without the need for an intermediate source language transcription and timing (i.e. captions). However, the joint generation of source captions and target subtitles does not only bring potential output quality advantages when the two decoding processes inform each other, but it is also often required in multilingual scenarios. In this work, we focus on ST models which generate consistent captions-subtitles in terms of structure and lexical content. We further introduce new metrics for evaluating subtitling consistency. Our findings show that joint decoding leads to increased performance and consistency between the generated captions and subtitles while still allowing for sufficient flexibility to produce subtitles conforming to language-specific needs and norms.
翻译:语音翻译(ST)最近对制作字幕的兴趣日益浓厚,不需要中间源码语言抄录和时间(即标题),但是,联合生成源码字幕和目标字幕不仅在两个解码程序相互交流时带来潜在的产出质量优势,而且在多语种情景中也经常需要这样做。在这项工作中,我们侧重于在结构和词汇内容方面产生一致字幕字幕下的标题的ST模型。我们进一步引入了用于评价字幕一致性的新指标。我们的调查结果显示,联合解码导致生成的字幕和字幕之间的性能提高和一致性,同时仍然允许充分的灵活性,以产生符合语言特定需要和规范的字幕。