The increasing reliability of automatic speech recognition has proliferated its everyday use. However, for research purposes, it is often unclear which model one should choose for a task, particularly if there is a requirement for speed as well as accuracy. In this paper, we systematically evaluate six speech recognizers using metrics including word error rate, latency, and the number of updates to already recognized words on English test data, as well as propose and compare two methods for streaming audio into recognizers for incremental recognition. We further propose Revokes per Second as a new metric for evaluating incremental recognition and demonstrate that it provides insights into overall model performance. We find that, generally, local recognizers are faster and require fewer updates than cloud-based recognizers. Finally, we find Meta's Wav2Vec model to be the fastest, and find Mozilla's DeepSpeech model to be the most stable in its predictions.
翻译:自动语音识别的可靠性不断提高,这已逐渐增加,但为研究目的,人们往往不清楚一个人应该选择哪一种模式来承担一项任务,特别是如果需要速度和准确性。在本文中,我们系统地评估了六个语音识别器,使用包括字出错率、延时率和英文测试数据中已确认字数在内的计量器,并提议和比较两种方法,将音频流入识别器,以便逐步识别。我们进一步提议每秒更新一次,作为评估递增识别的新标准,并表明它能提供对总体模型性能的洞察力。我们发现,一般而言,当地识别器比基于云的识别器更快,需要的更新更少。最后,我们发现Meta的Wav2Vec模型是最快的,我们发现Mezilla的DeepSpeech模型在预测中是最稳定的。