Nowadays, the interest in code-mixing has become ubiquitous in Natural Language Processing (NLP); however, not much attention has been given to address this phenomenon for Speech Translation (ST) task. This can be solely attributed to the lack of code-mixed ST task labelled data. Thus, we introduce Prabhupadavani, which is a multilingual code-mixed ST dataset for 25 languages. It is multi-domain, covers ten language families, containing 94 hours of speech by 130+ speakers, manually aligned with corresponding text in the target language. The Prabhupadavani is about Vedic culture and heritage from Indic literature, where code-switching in the case of quotation from literature is important in the context of humanities teaching. To the best of our knowledge, Prabhupadvani is the first multi-lingual code-mixed ST dataset available in the ST literature. This data also can be used for a code-mixed machine translation task. All the dataset can be accessed at https://github.com/frozentoad9/CMST.
翻译:目前,在自然语言处理(NLP)中,对编码混合的兴趣已变得无处不在;然而,对于处理语音翻译(ST)任务,没有多少注意解决这一现象,这完全归因于缺乏编码混合的ST任务标记数据。因此,我们引入了Prabhupadavani,这是一个25种语言的多语种代码混合的ST数据集。这是一个多域,涵盖十个语言家庭,包含130+发言者94小时的演讲时间,与目标语言的相应文本进行人工校正。Prabhupadavani是关于Indi文献中的Vedic文化和遗产的,其中文学引文中的编码转换在人文教学中很重要。就我们的知识而言,Prabhupadvani是第一个在ST文献中使用的多语种代码混合的ST数据集。这些数据也可以用于代码混合的机器翻译任务。所有数据集都可以在 https://github.com/frozentoad9/CMST上查阅。