We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages. It is fully transcribed and covers 6 English-to-X translation as well as 6 X-to-English translation directions. To the best of our knowledge, this is the first open benchmark for audio-visual speech-to-text translation and the largest open benchmark for multilingual audio-visual speech recognition. Our baseline results show that MuAViC is effective for building noise-robust speech recognition and translation models. We make the corpus available at https://github.com/facebookresearch/muavic.
翻译:我们介绍一个多语种视听材料库MuAVIC,这是一个以9种语言进行强有力的语音识别和强有力的语音对文本翻译的多语种视听材料库,提供12小时的9种语言的视听语言语音识别和语音对文本翻译,它被完全改写并覆盖了6个英语对X翻译以及6个X-英语翻译方向。据我们所知,这是视听语音对文本翻译的第一个开放基准,也是多语种视听语言识别的最大开放基准。我们的基线结果显示,MuAVIC对于建立噪音对语音识别和翻译模型是有效的。我们在https://github.com/facebookresearch/muavic上提供了该保护。</s>