Speech-to-text translation (ST), which directly translates the source language speech to the target language text, has attracted intensive attention recently. However, the combination of speech recognition and machine translation in a single model poses a heavy burden on the direct cross-modal cross-lingual mapping. To reduce the learning difficulty, we propose COnSecutive Transcription and Translation (COSTT), an integral approach for speech-to-text translation. The key idea is to generate source transcript and target translation text with a single decoder. It benefits the model training so that additional large parallel text corpus can be fully exploited to enhance the speech translation training. Our method is verified on three mainstream datasets, including Augmented LibriSpeech English-French dataset, IWSLT2018 English-German dataset, and TED English-Chinese dataset. Experiments show that our proposed COSTT outperforms or on par with the previous state-of-the-art methods on the three datasets. We have released our code at \url{https://github.com/dqqcasia/st}.
翻译:直接将源语言发言翻译到目标语言文本的语音到文本翻译(ST)最近引起人们的高度关注。然而,将语音识别和机器翻译合并成单一模式,给直接的跨模式跨语言绘图带来了沉重的负担。为减少学习困难,我们提议Concecutive Transnation和翻译(COSTT),这是语音到文本翻译的综合办法。关键的想法是用一个解码器生成源文本和目标翻译文本。它有利于示范培训,以便充分利用更多的大型平行文本材料来强化语音翻译培训。我们的方法在三个主流数据集上进行了验证,包括增强的LibriSpeech英法数据集、IWSLT2018英德数据集和TED英文-中文本数据集。实验显示,我们提议的COSTT在三个数据集上超越了或与先前的状态-艺术方法相当。我们在以下网站发布了我们的代码:{https://github.com/dqcasia/st}。