Existing speech-to-speech translation models fall into two camps: textless models trained with hundreds of hours of parallel speech data or unsupervised models that leverage text as an intermediate step. Both approaches limit building speech-to-speech translation models for a wide range of languages, as they exclude languages that are primarily spoken and language pairs that lack large-scale parallel speech data. We present a new framework for training textless low-resource speech-to-speech translation (S2ST) systems that only need dozens of hours of parallel speech data. We reformulate S2ST as a unit-to-unit seq2seq translation task, and start by pretraining a model on large-scale monolingual speech data. Then, we finetune it with a small amount of parallel speech data ($20-60$ hours). Lastly, we improve model performance through an unsupervised backtranslation objective. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech data. Evaluated using the ASR-BLEU metric, our models achieve reasonable performance on all three domains, with some being within 1-2 points of our supervised topline.
翻译:暂无翻译