We study speech-to-speech translation (S2ST) that translates speech from one language into another language and focuses on building systems to support languages without standard text writing systems. We use English-Taiwanese Hokkien as a case study, and present an end-to-end solution from training data collection, modeling choices to benchmark dataset release. First, we present efforts on creating human annotated data, automatically mining data from large unlabeled speech datasets, and adopting pseudo-labeling to produce weakly supervised data. On the modeling, we take advantage of recent advances in applying self-supervised discrete representations as target for prediction in S2ST and show the effectiveness of leveraging additional text supervision from Mandarin, a language similar to Hokkien, in model training. Finally, we release an S2ST benchmark set to facilitate future research in this field. The demo can be found at https://huggingface.co/spaces/facebook/Hokkien_Translation .
翻译:我们研究语音到语音翻译(S2ST),将语言从一种语言翻译成另一种语言,重点是建立系统,支持没有标准文本写法系统的语言。我们用英语-台湾Hokkien作为案例研究,从培训数据收集中提出一个端到端的解决办法,为基准数据集发布提供模型。首先,我们介绍为创建人类附加说明数据、从大型未贴标签语音数据集自动挖掘数据以及采用伪标签来生成薄弱监管数据所作的努力。在模型方面,我们利用最近在应用自我监督的离散表示作为S2ST预测目标方面取得的进展,并展示在示范培训中利用曼达林(一种类似于霍肯的语言)的额外文本监督的有效性。最后,我们发布了一个S2ST基准集,以促进该领域的未来研究。演示可在https://huggingface.co/spaces/facebook/Hokkien_Transilation查阅。