Despite improved performances of the latest Automatic Speech Recognition (ASR) systems, transcription errors are still unavoidable. These errors can have a considerable impact in critical domains such as healthcare, when used to help with clinical documentation. Therefore, detecting ASR errors is a critical first step in preventing further error propagation to downstream applications. To this end, we propose a novel end-to-end approach for ASR error detection using audio-transcript entailment. To the best of our knowledge, we are the first to frame this problem as an end-to-end entailment task between the audio segment and its corresponding transcript segment. Our intuition is that there should be a bidirectional entailment between audio and transcript when there is no recognition error and vice versa. The proposed model utilizes an acoustic encoder and a linguistic encoder to model the speech and transcript respectively. The encoded representations of both modalities are fused to predict the entailment. Since doctor-patient conversations are used in our experiments, a particular emphasis is placed on medical terms. Our proposed model achieves classification error rates (CER) of 26.2% on all transcription errors and 23% on medical errors specifically, leading to improvements upon a strong baseline by 12% and 15.4%, respectively.
翻译:尽管最新自动语音识别系统(ASR)的性能有所改进,但抄录错误仍然不可避免。这些错误在医疗等关键领域可以产生相当大的影响,例如医疗,用于帮助临床文件。因此,发现ASR错误是防止进一步错误传播到下游应用程序的关键第一步。为此,我们建议采用新型的端对端方法,使用音标来检测ASR错误。据我们所知,我们是第一个将这一问题描述为音频段及其相应的记录部分之间的端对端要求任务。我们的直觉是,在没有识别错误和反向错误时,音频和笔录之间应存在双向要求。拟议模型分别使用声学编码器和语言编码器来制作演讲和笔录的模型。两种模式的编码表达方式都结合了预测需要。由于我们实验中使用了医生-病人谈话,因此特别侧重于医学术语。我们提议的模型在所有记录错误和记录错误的分类错误中分别实现26.2%和23%的分类错误率(CER),具体地说,通过12个基准改进了15 %的分类率和23%。