In this paper we present two datasets for Tamasheq, a developing language mainly spoken in Mali and Niger. These two datasets were made available for the IWSLT 2022 low-resource speech translation track, and they consist of collections of radio recordings from the Studio Kalangou (Niger) and Studio Tamani (Mali) daily broadcast news. We share (i) a massive amount of unlabeled audio data (671 hours) in five languages: French from Niger, Fulfulde, Hausa, Tamasheq and Zarma, and (ii) a smaller parallel corpus of audio recordings (17 hours) in Tamasheq, with utterance-level translations in the French language. All this data is shared under the Creative Commons BY-NC-ND 3.0 license. We hope these resources will inspire the speech community to develop and benchmark models using the Tamasheq language.
翻译:在本文中,我们为主要在马里和尼日尔使用的正在发展的语言Tamasheq提供了两套数据集,这两套数据集是为IWSLT 2022低资源语音翻译轨道提供的,它们收集了Kalangou工作室(尼日尔)和Tamani工作室(马里)每日广播新闻的电台录音,我们用五种语言分享:(一) 大量未贴标签的音频数据(671小时):来自尼日尔、Fulfulde、Hausa、Tamasheq和Zarma的法语,以及(二) 在Tamasheq的少量平行录音(17小时),并用法语进行语音翻译,所有这些数据都根据NC-ND3.0创用许可证共享,我们希望这些资源将激励语言界制定和基准模式,使用Tamasheq语言。