End-to-end speech-to-speech translation (S2ST) without relying on intermediate text representations is a rapidly emerging frontier of research. Recent works have demonstrated that the performance of such direct S2ST systems is approaching that of conventional cascade S2ST when trained on comparable datasets. However, in practice, the performance of direct S2ST is bounded by the availability of paired S2ST training data. In this work, we explore multiple approaches for leveraging much more widely available unsupervised and weakly-supervised speech and text data to improve the performance of direct S2ST based on Translatotron 2. With our most effective approaches, the average translation quality of direct S2ST on 21 language pairs on the CVSS-C corpus is improved by +13.6 BLEU (or +113% relatively), as compared to the previous state-of-the-art trained without additional data. The improvements on low-resource language are even more significant (+398% relatively on average). Our comparative studies suggest future research directions for S2ST and speech representation learning.
翻译:近期的工作表明,这种直接的S2ST系统在接受可比数据集培训时,其性能接近常规级联S2ST,但在实践中,直接S2ST的性能受配对S2ST培训数据的可用性的约束。在这项工作中,我们探索多种办法,利用广泛得多的、不受监督的、受监管弱的语音和文本数据,提高基于 Translatotron 2. 的直接S2ST的性能。我们采用最有效的方法,CVSS-C群中21对语言的直接S2ST的平均翻译质量通过+13.6 BLEU(或相对的+113%)得到改进,而以前没有额外数据而受过培训的状态技术水平数据则与此相比,前者的性能与后者相比,前者的性能与后者相比,前者的平均性能得到改进。关于低资源语言的改进甚至更为显著(平均为+398%)。我们的比较研究表明,S2ST和语音代表学习的未来研究方向是S2ST和语音代表学习。