We address quality assessment for neural network based ASR by providing explanations that help increase our understanding of the system and ultimately help build trust in the system. Compared to simple classification labels, explaining transcriptions is more challenging as judging their correctness is not straightforward and transcriptions as a variable-length sequence is not handled by existing interpretable machine learning models. We provide an explanation for an ASR transcription as a subset of audio frames that is both a minimal and sufficient cause of the transcription. To do this, we adapt existing explainable AI (XAI) techniques from image classification-Statistical Fault Localisation(SFL) and Causal. Additionally, we use an adapted version of Local Interpretable Model-Agnostic Explanations (LIME) for ASR as a baseline in our experiments. We evaluate the quality of the explanations generated by the proposed techniques over three different ASR ,Google API, the baseline model of Sphinx, Deepspeech and 100 audio samples from the Commonvoice dataset.
翻译:与简单的分类标签相比,解释抄录更具有挑战性,因为判断其正确性并非直截了当,而可解释的机器学习模型并不处理可变长序列的抄录。我们解释ASR抄录作为音频框架的子集,这是抄录的一个最低和充分的原因。为此,我们调整了从图像分类-统计失当本地化(SFL)和Causal(Causal)中现有的可解释的 AI (XAI) 技术。此外,我们在实验中使用了适用于ASR的本地可解释模型和可辨识解释解释解释(LIME)的经修改版本作为基准。我们评估了三个不同的ASR、Google API、Sphinx、Deepspeech 和通用语音数据集的100个音频样本的拟议技术的解释质量。</s>