Fine-tuning pretrained language models (LMs) is a popular approach to automatic speech recognition (ASR) error detection during post-processing. While error detection systems often take advantage of statistical language archetypes captured by LMs, at times the pretrained knowledge can hinder error detection performance. For instance, presence of speech disfluencies might confuse the post-processing system into tagging disfluent but accurate transcriptions as ASR errors. Such confusion occurs because both error detection and disfluency detection tasks attempt to identify tokens at statistically unlikely positions. This paper proposes a scheme to improve existing LM-based ASR error detection systems, both in terms of detection scores and resilience to such distracting auxiliary tasks. Our approach adopts the popular mixup method in text feature space and can be utilized with any black-box ASR output. To demonstrate the effectiveness of our method, we conduct post-processing experiments with both traditional and end-to-end ASR systems (both for English and Korean languages) with 5 different speech corpora. We find that our method improves both ASR error detection F 1 scores and reduces the number of correctly transcribed disfluencies wrongly detected as ASR errors. Finally, we suggest methods to utilize resulting LMs directly in semi-supervised ASR training.
翻译:微调预先培训的语言模型(LMS)是处理后自动语音识别(ASR)错误的流行方法,虽然错误检测系统往往利用LMS所捕捉的统计语言古型,但先行知识有时会妨碍错误检测性能,例如,言语不通可能会将后处理系统混淆成标记不流、但准确的抄录为ASR错误。这种混淆之所以发生,是因为错误检测和不流利检测任务都试图在统计上不可能找到的状态上的标牌。本文提议了一个改进现有基于LM的ASR错误检测系统的计划,既包括检测分数,也包括对转移注意力的辅助任务的恢复能力。我们的方法在文本特征空间采用流行混合方法,并且可以与任何黑盒ASR输出一起使用。为了证明我们的方法的有效性,我们用传统的和终端到终端的ASR系统(英语和韩语)进行后处理实验,并用5种不同的语音团进行标记。我们发现我们的方法改进了ASR错误识别法1分,并减少了在错误的道路上直接识别错误的方法。我们最后建议用错误的方法。