End-to-end speech translation models have become a new trend in research due to their potential of reducing error propagation. However, these models still suffer from the challenge of data scarcity. How to effectively use unlabeled or other parallel corpora from machine translation is promising but still an open problem. In this paper, we propose Cross Speech-Text Network (XSTNet), an end-to-end model for speech-to-text translation. XSTNet takes both speech and text as input and outputs both transcription and translation text. The model benefits from its three key design aspects: a self-supervised pre-trained sub-network as the audio encoder, a multi-task training objective to exploit additional parallel bilingual text, and a progressive training procedure. We evaluate the performance of XSTNet and baselines on the MuST-C En-X and LibriSpeech En-Fr datasets. In particular, XSTNet achieves state-of-the-art results on all language directions with an average BLEU of 28.8, outperforming the previous best method by 3.2 BLEU. Code, models, cases, and more detailed analysis are available at https://github.com/ReneeYe/XSTNet.
翻译:终端到终端语音翻译模型(XSTNet)由于具有减少错误传播的潜力,已成为研究的新趋势。然而,这些模型仍然受到数据稀缺的挑战。如何有效使用机器翻译中未贴标签或其他平行的子公司是大有希望的,但仍然是一个尚未解决的问题。我们在此文件中提议Cross Speales-Text网络(XSTNet),这是一个语音到文本翻译的端到端模式。XSTNet将语音和文本作为输入和输出文本。该模型有三个关键设计方面的好处:一个自我监督的预先训练的子网络,作为音频编码器,一个多任务培训目标,利用额外的平行双语文本,以及一个渐进式培训程序。我们评估了XSTNet的性能和MuST-C En-X和LibSpeech En-Fr数据集的基线。特别是,XNet在所有语言方向上都取得了最先进的成果,平均为28.8级BLEU,将以往的最佳方法表现为3.2 BLEU代码、模型、案例和较详细的分析。