Self-supervised Audio Transformers (SAT) enable great success in many downstream speech applications like ASR, but how they work has not been widely explored yet. In this work, we present multiple strategies for the analysis of attention mechanisms in SAT. We categorize attentions into explainable categories, where we discover each category possesses its own unique functionality. We provide a visualization tool for understanding multi-head self-attention, importance ranking strategies for identifying critical attention, and attention refinement techniques to improve model performance.
翻译:自我监督的音频变换器(SAT)使得许多下游语言应用(如ASR)取得了巨大成功,但是它们是如何运作的还没有得到广泛探讨。 在这项工作中,我们提出了多种战略来分析SAT的注意机制。我们把注意力分为可解释的类别,发现每个类别都有自己的独特功能。我们提供了一个可视化工具,用以了解多头自我关注、确定关键注意力的重要排序战略以及改进模型性能的改进技术。