In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language. The backbone of SM2 is Transformer Transducer, which has high streaming capability. Instead of human labeled speech translation (ST) data, SM2 models are trained using weakly supervised data generated by converting the transcriptions in speech recognition corpora with a machine translation service. With 351 thousand hours of anonymized speech training data from 25 languages, SM2 models achieve comparable or even better ST quality than some recent popular large-scale non-streaming speech models. More importantly, we show that SM2 has the truly zero-shot capability when expanding to new target languages, yielding high quality ST results for {source-speech, target-text} pairs that are not seen during training.
翻译:在本文中,我们介绍我们的工作,即建立一个流传多语种语言模式(SM2),可以将多种口语转录或翻译成目标语言文本。SM2的骨干是变换器转换器转换器,具有高流能力。SM2不是使用人类标签的语音翻译数据,而是使用通过机器翻译服务转换语音识别公司转录的薄弱监管数据对SM2模型进行培训。SM2的匿名语音培训数据来自25种语言,有351千小时的时间,SM2比最近一些流行的大规模非流传语言模式的ST质量要高,甚至更高。更重要的是,我们显示SM2在扩展为新目标语言时具有真正的零弹性能力,为培训中看不到的{源语、目标文字}夫妇带来高质量的ST结果。