Automatic speech recognition (ASR) is improving ever more at mimicking human speech processing. The functioning of ASR, however, remains to a large extent obfuscated by the complex structure of the deep neural networks (DNNs) they are based on. In this paper, we show how so-called attribution methods, that we import from image recognition and suitably adapt to handle audio data, can help to clarify the working of ASR. Taking DeepSpeech, an end-to-end model for ASR, as a case study, we show how these techniques help to visualize which features of the input are the most influential in determining the output. We focus on three visualization techniques: Layer-wise Relevance Propagation (LRP), Saliency Maps, and Shapley Additive Explanations (SHAP). We compare these methods and discuss potential further applications, such as in the detection of adversarial examples.
翻译:自动语音识别(ASR)在模仿人的语音处理方面正在不断改进。但是,ASR的功能在很大程度上仍然被它们基于的深神经网络(DNNs)的复杂结构所混淆。在本文中,我们展示了我们从图像识别和适当适应处理音频数据中引进的所谓归属方法如何有助于澄清ASR的工作。用ASR的端对端模式DeepSpeech作为案例研究,我们展示了这些技术如何帮助直观了解输入的哪些特征在决定输出方面最有影响力。我们侧重于三种直观化技术:图层与相关性促进(LRP)、Saliency地图和Shapley Aditive解释(SHAP)。我们比较了这些方法,并讨论了进一步应用的可能性,例如在检测对抗性实例方面。