The "Switchboard benchmark" is a very well-known test set in automatic speech recognition (ASR) research, establishing record-setting performance for systems that claim human-level transcription accuracy. This work highlights lesser-known practical considerations of this evaluation, demonstrating major improvements in word error rate (WER) by correcting the reference transcriptions and deviating from the official scoring methodology. In this more detailed and reproducible scheme, even commercial ASR systems can score below 5% WER and the established record for a research system is lowered to 2.3%. An alternative metric of transcript precision is proposed, which does not penalize deletions and appears to be more discriminating for human vs. machine performance. While commercial ASR systems are still below this threshold, a research system is shown to clearly surpass the accuracy of commercial human speech recognition. This work also explores using standardized scoring tools to compute oracle WER by selecting the best among a list of alternatives. A phrase alternatives representation is compared to utterance-level N-best lists and word-level data structures; using dense lattices and adding out-of-vocabulary words, this achieves an oracle WER of 0.18%.
翻译:“电动板基准”是自动语音识别(ASR)研究中众所周知的测试,它为声称人文记录准确性的系统建立了记录性性能。这项工作突出地说明了本次评价中不太为人知的实际考虑,通过纠正参考抄录和偏离官方评分方法,显示了字差率(WER)的重大改进。在这个更加详细和可复制的方案中,即使是商业的ASR系统也可以得分低于5% WER,研究系统的既定记录也降低到2.3%。提出了笔录精确性能的替代标准,该标准并不惩罚删除,而且似乎对人类与机器性能更加有区别。虽然商业的ASR系统仍然低于这一门槛,但研究表明,研究系统显然超过了商业人文语音识别的准确性。这项工作还探索使用标准化的评分工具,通过选择最佳的替代清单来计算或压缩WER。替代词的表示与纯度水平的N最佳名单和字级数据结构进行比较;使用密集的拉托和添加外语句,从而达到0.18 %或韦勒。