We present our work on collecting ArzEn-ST, a code-switched Egyptian Arabic - English Speech Translation Corpus. This corpus is an extension of the ArzEn speech corpus, which was collected through informal interviews with bilingual speakers. In this work, we collect translations in both directions, monolingual Egyptian Arabic and monolingual English, forming a three-way speech translation corpus. We make the translation guidelines and corpus publicly available. We also report results for baseline systems for machine translation and speech translation tasks. We believe this is a valuable resource that can motivate and facilitate further research studying the code-switching phenomenon from a linguistic perspective and can be used to train and evaluate NLP systems.
翻译:我们介绍我们收集Arzen-ST的工作,这是一个密码开关的埃及阿拉伯文-英语语言翻译体,该体是Arzen语音资料库的延伸,它是通过非正式采访双语发言者收集的,在这项工作中,我们收集双向译文,单语埃及阿拉伯文和单语英语,形成一个三种语言翻译体,我们公开提供翻译准则和文体,我们还报告机器翻译和语言翻译任务基准系统的结果,我们认为这是一个宝贵的资源,能够激励和促进进一步研究从语言角度研究代码转换现象,并可用于培训和评价国家语言方案系统。