Many studies have examined the shortcomings of word error rate (WER) as an evaluation metric for automatic speech recognition (ASR) systems, particularly when used for spoken language understanding tasks such as intent recognition and dialogue systems. In this paper, we propose Hybrid-SD ($\text{H}_{\text{SD}}$), a new hybrid evaluation metric for ASR systems that takes into account both semantic correctness and error rate. To generate sentence dissimilarity scores (SD), we built a fast and lightweight SNanoBERT model using distillation techniques. Our experiments show that the SNanoBERT model is 25.9x smaller and 38.8x faster than SRoBERTa while achieving comparable results on well-known benchmarks. Hence, making it suitable for deploying with ASR models on edge devices. We also show that $\text{H}_{\text{SD}}$ correlates more strongly with downstream tasks such as intent recognition and named-entity recognition (NER).
翻译:许多研究审查了单词错误率(WER)作为自动语音识别(ASR)系统评价指标的缺点,特别是在用于诸如意向识别和对话系统等口头语言理解任务时。在本文件中,我们提议对单词错误率(WER)作为自动语音识别(ASR)系统的评价指标,对单词错误率(WER)的缺点进行综合评价,以考虑到语义正确性和错误率。为生成判决差异分数(SD),我们利用蒸馏技术建立了一个快速和轻量的SNANOBERT模型。我们的实验表明,SNANOBERT模型比SROBERTA模型要小25.9x和38.8x,同时在众所周知的基准上取得可比结果。因此,我们建议它适合在边缘装置上与ASR模型一起部署。我们还表明,$text{H{text{SD{$($)与下游任务(例如意向识别和命名实体识别)更紧密地相关。