This paper presents a novel optimization framework for automatic speech recognition (ASR) with the aim of reducing hallucinations produced by an ASR model. The proposed framework optimizes the ASR model to maximize an expected factual consistency score between ASR hypotheses and ground-truth transcriptions, where the factual consistency score is computed by a separately trained estimator. Experimental results using the AMI meeting corpus and the VoxPopuli corpus show that the ASR model trained with the proposed framework generates ASR hypotheses that have significantly higher consistency scores with ground-truth transcriptions while maintaining the word error rates close to those of cross entropy-trained ASR models. Furthermore, it is shown that training the ASR models with the proposed framework improves the speech summarization quality as measured by the factual consistency of meeting conversation summaries generated by a large language model.
翻译:本文件介绍了自动语音识别的新优化框架,目的是减少ASR模型产生的幻觉,拟议框架优化了ASR模型,以最大限度地实现ASR假设和地面实况记录之间的预期实际一致性分数,即事实一致性分数由经过单独培训的估测员单独计算。使用AMI会议文体和VoxPopuli文体的实验结果显示,用拟议框架培训的ASR模型产生的ASR假设与地面实况记录得分相当高,同时将字误差率维持在接近交叉加密培训的ASR模型的水平上。此外,还表明,用拟议框架培训ASR模型提高了用大型语文模型产生的会议谈话摘要实际一致性衡量的语音汇总质量。</s>