End-to-end automatic speech recognition (ASR) models aim to learn a generalised speech representation. However, there are limited tools available to understand the internal functions and the effect of hierarchical dependencies within the model architecture. It is crucial to understand the correlations between the layer-wise representations, to derive insights on the relationship between neural representations and performance. Previous investigations of network similarities using correlation analysis techniques have not been explored for End-to-End ASR models. This paper analyses and explores the internal dynamics between layers during training with CNN, LSTM and Transformer based approaches using Canonical correlation analysis (CCA) and centered kernel alignment (CKA) for the experiments. It was found that neural representations within CNN layers exhibit hierarchical correlation dependencies as layer depth increases but this is mostly limited to cases where neural representation correlates more closely. This behaviour is not observed in LSTM architecture, however there is a bottom-up pattern observed across the training process, while Transformer encoder layers exhibit irregular coefficiency correlation as neural depth increases. Altogether, these results provide new insights into the role that neural architectures have upon speech recognition performance. More specifically, these techniques can be used as indicators to build better performing speech recognition models.
翻译:终端到终端自动语音识别(ASR)模式旨在学习通用语音代表,但现有工具有限,无法理解模型结构内部功能和等级依附效应的影响,必须理解分层代表的关联性,了解神经表和性能之间的关系。以前对终端到终端自动语音识别(ASR)模式使用相关分析技术的网络相似性的调查尚未为ASR模式探索。本文分析和探索了在CNN、LSTM和变异器培训期间使用Canonical相关分析(CCA)和核心内核对准(CKA)进行实验时各层次之间的内部动态。发现CNN层神经表层的神经表显示随着层深度的增加而显示出等级关联性关联性,但这主要限于神经表层代表关系更为密切的情况。LSTM结构没有观察到这种行为,但在整个培训过程中都观察到了一种自下而上的模式,而变电解器层则随着神经深度的增加而表现出不规则的协同效应相关性。这些结果可以使人们对神经结构结构在使用语音识别能力方面所起的作用产生新的了解。具体来说,这些技术可以用来将语音识别。