TV subtitles are a rich source of transcriptions of many types of speech, ranging from read speech in news reports to conversational and spontaneous speech in talk shows and soaps. However, subtitles are not verbatim (i.e. exact) transcriptions of speech, so they cannot be used directly to improve an Automatic Speech Recognition (ASR) model. We propose a multitask dual-decoder Transformer model that jointly performs ASR and automatic subtitling. The ASR decoder (possibly pre-trained) predicts the verbatim output and the subtitle decoder generates a subtitle, while sharing the encoder. The two decoders can be independent or connected. The model is trained to perform both tasks jointly, and is able to effectively use subtitle data. We show improvements on regular ASR and on spontaneous and conversational ASR by incorporating the additional subtitle decoder. The method does not require preprocessing (aligning, filtering, pseudo-labeling, ...) of the subtitles.
翻译:电视字幕是许多类型演讲的丰富抄录来源,从新闻报道中的读话到谈话节目和肥皂中的谈话和自发性演讲。然而,字幕不是逐字(即确切)的语音抄录,因此不能直接用来改进自动语音识别模式。我们建议采用多塔什克双分解器变换模型,共同执行ASR和自动字幕。ASR解码器(可能已经预先培训)预测了逐字记录和字幕解码器生成字幕,同时分享了编码器。两个解码器可以独立或连接。该模型经过培训,可以共同执行两个任务,并能够有效利用字幕数据。我们通过添加字幕解码器来显示常规的ASR以及自发和谈话性ASR的改进。该方法不需要预先处理字幕(配音、过滤、假标签、.)。