The amount of labeled data to train models for speech tasks is limited for most languages, however, the data scarcity is exacerbated for speech translation which requires labeled data covering two different languages. To address this issue, we study a simple and effective approach to build speech translation systems without labeled data by leveraging recent advances in unsupervised speech recognition, machine translation and speech synthesis, either in a pipeline approach, or to generate pseudo-labels for training end-to-end speech translation models. Furthermore, we present an unsupervised domain adaptation technique for pre-trained speech models which improves the performance of downstream unsupervised speech recognition, especially for low-resource settings. Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art by 3.2 BLEU on the Libri-Trans benchmark, on CoVoST 2, our best systems outperform the best supervised end-to-end models (without pre-training) from only two years ago by an average of 5.0 BLEU over five X-En directions. We also report competitive results on MuST-C and CVSS benchmarks.
翻译:对于大多数语文来说,用于培训演讲任务模型的标签数据数量有限,但是,对于需要两种不同语文的标签数据的语言翻译而言,数据稀缺的情况更加严重。为了解决这一问题,我们研究一种简单而有效的方法,即利用未经监督的语音识别、机器翻译和语音合成的最新进展,在编审过程中利用未经监督的语音识别、机器翻译和语音合成的最新进展,或者为培训端对端语音翻译模型生成假标签。此外,我们还为预先培训的演讲模型提供了一种不受监督的域适应技术,改进了下游不受监督的语音识别的性能,特别是低资源环境。实验显示,未经监督的语音对文本的翻译超越了3.2 BLEU在自由-翻译基准2上未经监督的艺术的前一种状态。我们的最佳系统比两年前仅5个X-E方向的5.0 BLEU的平均数(无需预先培训)超过了最佳监管端对端对端模式(未经事先培训),我们还报告了在MUST-C和CVSS基准上的竞争性结果。