We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and speech translation (ST) research across many non-English source languages. The corpus is a collection of audio recordings from TEDx talks in 8 source languages. We segment transcripts into sentences and align them to the source-language audio and target-language translations. The corpus is released along with open-sourced code enabling extension to new talks and languages as they become available. Our corpus creation methodology can be applied to more languages than previous work, and creates multi-way parallel evaluation sets. We provide baselines in multiple ASR and ST settings, including multilingual models to improve translation performance for low-resource language pairs.
翻译:我们展示了多语言TEDx文集,该文集是为支持许多非英语语言语言的语音识别和语音翻译研究而建立的,以8种源语言收集TEDx会谈的录音记录,我们将笔录分为句子,并将其与原始语言的音频和目标语言翻译相统一,该文集与开放源代码一起发布,允许在新语和语言可用时将其扩展为新的话语和语言。我们的创制方法可以适用于比以往更多的语言,并创建多路平行的评价组。我们在多种ASR和ST设置中提供了基线,包括提高低资源语言对口翻译功能的多语言模型。