" 结束语到结束语的演讲模式 " 如何了解演讲者、语言和频道信息? (What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis)

End-to-end DNN architectures have pushed the state-of-the-art in speech technologies, as well as in other spheres of AI, leading researchers to train more complex and deeper models. These improvements came at the cost of transparency. DNNs are innately opaque and difficult to interpret. We no longer understand what features are learned, where they are preserved, and how they inter-operate. Such an analysis is important for better model understanding, debugging and to ensure fairness in ethical decision making. In this work, we analyze the representations trained within deep speech models, towards the task of speaker recognition, dialect identification and reconstruction of masked signals. We carry a layer- and neuron-level analysis on the utterance-level representations captured within pretrained speech models for speaker, language and channel properties. We study: is this information captured in the learned representations? where is it preserved? how is it distributed? and can we identify a minimal subset of network that posses this information. Using diagnostic classifiers, we answered these questions. Our results reveal: (i) channel and gender information is omnipresent and is redundantly distributed (ii) complex properties such as dialectal information is encoded only in the task-oriented pretrained network and is localised in the upper layers (iii) a minimal subset of neurons can be extracted to encode the predefined property (iv) salient neurons are sometimes shared between properties and can highlights presence of biases in the network. Our cross-architectural comparison indicates that (v) the pretrained models captures speaker-invariant information and (vi) the pretrained CNNs models are competitive to the Transformers for encoding information for the studied properties. To the best of our knowledge, this is the first study to investigate neuron analysis on the speech models.

翻译：端到端 DNN 架构推动了语音技术以及AI 其他领域的最先进科技, 使研究人员能够培训更复杂和更深的模型。这些改进是以透明度为代价的。 DNN 是天生的不透明且难以解释的。我们不再理解哪些特征是学到的, 在哪里保存, 如何进行操作。这样的分析对于更好的模型理解、调试和确保道德决策的公平性来说很重要。在这项工作中, 我们分析在深层语音模型中培训的演示, 以辨识、方言识别和重塑隐藏的信号的任务为主。我们对在语言、频道和频道属性预训练的语音模型进行层和神经级的对比分析。我们研究的是: 所学的显示的信息是哪些特征, 在哪里保存这些特征, 如何传播? 我们能否确定一个最起码的网络集, 使用诊断性能解析器, 我们就能解答这些问题。我们的结果表明:(i) 频道和性别信息是全调的, 并且有时在服务器的精细的精度中进行(ii) 复杂的网络的精度分析。