Transformer-based models have become state-of-the-art tools in various machine learning tasks, including time series classification, yet their complexity makes understanding their internal decision-making challenging. Existing explainability methods often focus on input-output attributions, leaving the internal mechanisms largely opaque. This paper addresses this gap by adapting various Mechanistic Interpretability techniques; activation patching, attention saliency, and sparse autoencoders, from NLP to transformer architectures designed explicitly for time series classification. We systematically probe the internal causal roles of individual attention heads and timesteps, revealing causal structures within these models. Through experimentation on a benchmark time series dataset, we construct causal graphs illustrating how information propagates internally, highlighting key attention heads and temporal positions driving correct classifications. Additionally, we demonstrate the potential of sparse autoencoders for uncovering interpretable latent features. Our findings provide both methodological contributions to transformer interpretability and novel insights into the functional mechanics underlying transformer performance in time series classification tasks.
翻译:基于Transformer的模型已成为包括时间序列分类在内的多种机器学习任务中的先进工具,但其复杂性使得理解其内部决策过程具有挑战性。现有的可解释性方法通常侧重于输入-输出归因,导致内部机制在很大程度上仍不透明。本文通过将多种机制可解释性技术——包括激活修补、注意力显著性和稀疏自编码器——从自然语言处理领域迁移至专为时间序列分类设计的Transformer架构,以填补这一空白。我们系统地探究了单个注意力头和时间步的内部因果作用,揭示了这些模型内部的因果结构。通过在基准时间序列数据集上的实验,我们构建了因果图以说明信息如何在内部传播,并突出了驱动正确分类的关键注意力头和时间位置。此外,我们还展示了稀疏自编码器在揭示可解释潜在特征方面的潜力。我们的研究结果为Transformer的可解释性提供了方法论贡献,并对Transformer在时间序列分类任务中性能的功能机制提供了新的见解。