Recent advances in automatic speech recognition (ASR) have achieved accuracy levels comparable to human transcribers, which led researchers to debate if the machine has reached human performance. Previous work focused on the English language and modular hidden Markov model-deep neural network (HMM-DNN) systems. In this paper, we perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition (HSR) on the Arabic language and its dialects. For the HSR, we evaluate linguist performance and lay-native speaker performance on a new dataset collected as a part of this study. For ASR the end-to-end work led to 12.5%, 27.5%, 33.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively. Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.6% on average.
翻译:在自动语音识别(ASR)方面的最新进展达到了与人类传译器相似的准确度,这使得研究人员在机器达到人类性能时可以进行辩论。以前的工作重点是英语和模块隐藏的Markov 模型深神经网络(HMM-DNNN)系统。在本文中,我们对端到端变压器ASR、模块HMM-DNN AS(HSR)和人文语音识别(HSR)的阿拉伯语及其方言进行了全面的衡量标准。在《HSR》中,我们评估了作为本研究一部分收集的新数据集的语言性能和外形语语语音表现。对于ASR来说,端到端工作导致12.5%、27.5%、33.8%的WER;分别是MGB2、MGB3和MGB5挑战的一个新的性能里程碑。我们的结果表明,阿拉伯语的人类性能仍然大大高于机器,平均3.6%的绝对WER差距。