Recent advances in automatic speech recognition (ASR) have achieved accuracy levels comparable to human transcribers, which led researchers to debate if the machine has reached human performance. Previous work focused on the English language and modular hidden Markov model-deep neural network (HMM-DNN) systems. In this paper, we perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition (HSR) on the Arabic language and its dialects. For the HSR, we evaluate linguist performance and lay-native speaker performance on a new dataset collected as a part of this study. For ASR the end-to-end work led to 12.5%, 27.5%, 33.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively. Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.5% on average.
翻译:在自动语音识别(ASR)方面的最新进展达到了与人类传译器相似的准确度,这使得研究人员在机器达到人类性能时就机器的精确度展开辩论。以前的工作侧重于英语和模块隐藏的Markov 模型深神经网络(HMM-DNN)系统。在本文中,我们为端到端变压器ASR、模块HMM-DNN AS和人类语音识别(HSR)阿拉伯语及其方言进行了全面的衡量基准。在《HSR》中,我们评估了作为本研究一部分而收集的一套新数据集的语言性能和异端扬声器性能。对于ASR来说,端到端工作分别导致12.5%、27.5%、33.8%的WER;MGB2、MGB3和MGB5挑战的一个新的性能里程碑。我们的结果表明,阿拉伯语的人类性能仍然大大优于机器,平均具有3.5%的绝对WER差距。