语音基础模型的幻觉基准 (Hallucination Benchmark for Speech Foundation Models)

Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal). While similar to conventional decoding errors in potentially compromising the usability of transcriptions for downstream applications, hallucinations can be more detrimental due to their preservation of syntactically and semantically plausible structure. This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law. Conventional evaluation metrics are primarily centered on error-based metrics and fail to distinguish between phonetic inaccuracies and hallucinations. Consequently, there is a critical need for new evaluation frameworks that can effectively identify and assess models with a heightened propensity for generating hallucinated content. To this end, we introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic. We define targeted metrics within each category to produce interpretable profiles of model behavior. Through evaluation across various architectures and speech domains, we have found that SHALLOW metrics correlate strongly with word error rate (WER) when recognition quality is high (i.e., low WER). Still, this correlation weakens substantially as WER increases. SHALLOW, therefore, captures fine-grained error patterns that WER fails to distinguish under degraded and challenging conditions. Our framework supports specific diagnosis of model weaknesses and provides feedback for model improvement beyond what aggregate error rates can offer.

翻译：自动语音识别（ASR）系统中的幻觉，是指由神经ASR模型生成的、与底层声学输入（即语音信号）完全无关的流畅且连贯的转录文本。虽然幻觉与传统解码错误类似，都可能损害转录文本在下游应用中的可用性，但由于幻觉保留了句法和语义上看似合理的结构，其危害可能更大。这种表面上的连贯性可能误导后续处理阶段，并引入严重风险，尤其是在医疗保健和法律等关键领域。传统的评估指标主要围绕基于错误的度量标准，无法区分语音不准确性和幻觉。因此，迫切需要能够有效识别和评估具有较高生成幻觉内容倾向的模型的新评估框架。为此，我们引入了SHALLOW，这是首个系统性地从四个互补维度（词汇、语音、形态和语义）对ASR中的幻觉现象进行分类和量化的基准框架。我们在每个类别中定义了针对性指标，以生成可解释的模型行为剖面。通过对不同架构和语音领域的评估，我们发现当识别质量较高（即词错误率较低）时，SHALLOW指标与词错误率（WER）高度相关；但随着WER增加，这种相关性显著减弱。因此，SHALLOW捕捉到了在性能下降和具有挑战性的条件下WER无法区分的细粒度错误模式。我们的框架支持对模型弱点的具体诊断，并提供超越聚合错误率所能提供的模型改进反馈。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日