Confidence estimate is an often requested feature in applications such as medical transcription where errors can impact patient care and the confidence estimate could be used to alert medical professionals to verify potential errors in recognition. In this paper, we present a lightweight neural confidence model tailored for Automatic Speech Recognition (ASR) system with Recurrent Neural Network Transducers (RNN-T). Compared to other existing approaches, our model utilizes: (a) the time information associated with recognized words, which reduces the computational complexity, and (b) a simple and elegant trick for mapping between sub-word and word sequences. The mapping addresses the non-unique tokenization and token deletion problems while amplifying differences between confusable words. Through extensive empirical evaluations on two different long-form test sets, we demonstrate that the model achieves a performance of 0.4 Normalized Cross Entropy (NCE) and 0.05 Expected Calibration Error (ECE). It is robust across different ASR configurations, including target types (graphemes vs. morphemes), traffic conditions (streaming vs. non-streaming), and encoder types. We further discuss the importance of evaluation metrics to reflect practical applications and highlight the need for further work in improving Area Under the Curve (AUC) for Negative Precision Rate (NPV) and True Negative Rate (TNR).
翻译:在医疗转录中,错误会影响病人的护理,信任估计可以用来提醒医疗专业人员注意非统一象征性和象征性删除问题,同时扩大可互译词之间的差异。在本文中,我们介绍了一个为自动语音识别系统定制的轻量神经信心模型(ASR),该模型与经常性神经网络传输器(RNN-T)一起使用。 与其他现有方法相比,我们的模型使用:(a) 与公认的词汇相关的时间信息,该词降低了计算复杂性,以及(b) 用于子词和字序列之间绘图的简单和优雅的技巧。绘图处理非统一象征性和象征性删除问题,同时扩大可互译词之间的差异。通过对两种不同的长式测试集的广泛经验性评估,我们证明该模型取得了0.4个标准化交叉式(NCE)和0.05个预期校准错误(欧洲经委会)的性能。它在不同亚SR配置中非常稳健,包括目标类型(logemes vs. morphemes)、交通条件(Slow vs. novering) 以及编码类型。我们进一步讨论了NLAximal prepal prepal practal practal practal practal practal