Automatic subtitling is the task of automatically translating the speech of an audiovisual product into short pieces of timed text, in other words, subtitles and their corresponding timestamps. The generated subtitles need to conform to multiple space and time requirements (length, reading speed) while being synchronised with the speech and segmented in a way that facilitates comprehension. Given its considerable complexity, automatic subtitling has so far been addressed through a pipeline of elements that deal separately with transcribing, translating, segmenting into subtitles and predicting timestamps. In this paper, we propose the first direct automatic subtitling model that generates target language subtitles and their timestamps from the source speech in a single solution. Comparisons with state-of-the-art cascaded models trained with both in- and out-domain data show that our system provides high-quality subtitles while also being competitive in terms of conformity, with all the advantages of maintaining a single model.
翻译:自动字幕是将视听产品的语音自动转换成短短的定时文本的任务,换句话说,字幕及其相应的时标。生成的字幕需要符合多个空间和时间要求(长度、阅读速度),同时要与语音同步,以有助于理解的方式进行分割。鉴于其相当复杂,迄今为止,自动字幕是通过一个分别处理转录、翻译、分解成字幕和预测时间标本的元素管道来解决的。在本文中,我们提出了第一个直接自动字幕模式,通过单一的解决方案从源语中生成目标语言字幕及其时标。与经过内部和外部数据培训的最先进的连锁模型的比较表明,我们的系统提供高质量的字幕,同时在符合性方面具有竞争力,同时保持单一模型的所有优势。