With the advent of globalization, there is an increasing demand for multilingual automatic speech recognition (ASR), handling language and dialectal variation of spoken content. Recent studies show its efficacy over monolingual systems. In this study, we design a large multilingual end-to-end ASR using self-attention based conformer architecture. We trained the system using Arabic (Ar), English (En) and French (Fr) languages. We evaluate the system performance handling: (i) monolingual (Ar, En and Fr); (ii) multi-dialectal (Modern Standard Arabic, along with dialectal variation such as Egyptian and Moroccan); (iii) code-switching -- cross-lingual (Ar-En/Fr) and dialectal (MSA-Egyptian dialect) test cases, and compare with current state-of-the-art systems. Furthermore, we investigate the influence of different embedding/character representations including character vs word-piece; shared vs distinct input symbol per language. Our findings demonstrate the strength of such a model by outperforming state-of-the-art monolingual dialectal Arabic and code-switching Arabic ASR.
翻译:随着全球化的到来,对多语种自动语音识别(ASR)、处理语言和口语内容的辩证变异的需求日益增加。最近的研究显示,它对于单语系统的效力。在本研究中,我们利用以自我注意为基础的校准结构设计了一个大型多语种端对端ASR。我们用阿拉伯语(Ar)、英语(En)和法语(Fr)语言对该系统进行了培训。我们评估了系统处理情况:(一) 单语(Ar、En和Fr);(二) 多对流(现代标准阿拉伯语,以及埃及和摩洛哥语等方言变);(三) 代码转换(Ar-En/Fr)和方言(MSA-埃及方言)测试案例,并与目前的艺术现状系统进行比较。此外,我们调查了不同嵌入/字符表达方式的影响,包括字符与字体;共享不同的输入符号。我们的调查结果通过超越了艺术单语阿拉伯语和代码转换阿拉伯方言的状态,显示了这种模型的力量。