This paper presents the design and development of multi-dialect automatic speech recognition for Arabic. Deep neural networks are becoming an effective tool to solve sequential data problems, particularly, adopting an end-to-end training of the system. Arabic speech recognition is a complex task because of the existence of multiple dialects, non-availability of large corpora, and missing vocalization. Thus, the first contribution of this work is the development of a large multi-dialectal corpus with either full or at least partially vocalized transcription. Additionally, the open-source corpus has been gathered from multiple sources that bring non-standard Arabic alphabets in transcription which are normalized by defining a common character-set. The second contribution is the development of a framework to train an acoustic model achieving state-of-the-art performance. The network architecture comprises of a combination of convolutional and recurrent layers. The spectrogram features of the audio data are extracted in the frequency vs time domain and fed in the network. The output frames, produced by the recurrent model, are further trained to align the audio features with its corresponding transcription sequences. The sequence alignment is performed using a beam search decoder with a tetra-gram language model. The proposed system achieved a 14% error rate which outperforms previous systems.
翻译:本文介绍阿拉伯语多方自动语音识别的设计和开发。 深神经网络正在成为解决连续数据问题的有效工具, 特别是采用系统端对端培训。 阿拉伯语语音识别是一项复杂的任务, 因为存在多种方言, 无法提供大公司, 缺少声学。 因此, 这项工作的第一个贡献是开发一个大型多对面的全声或至少部分声学抄录系统。 此外, 开放源代码已经从多个来源收集, 将非标准的阿拉伯字母解析成正常的转录序列。 第二个贡献是开发一个框架, 用于培训声学模型, 实现最先进的性能。 网络结构由动态层和经常性层组合组成。 音频数据的光谱特征在频率与时间域之间提取, 并在网络中输入。 经常模式产生的输出框架经过进一步培训, 使音频特征与其相应的转录序列相统一, 通过定义一个通用字符集。 第二项贡献是开发一个框架, 用于培训声学模型, 用于培训音学模型, 实现序列调整, 并使用以前的搜索率系统 。 进行排序调整, 以旧的调系统 。 已经实现的系统 格式 。