Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training and interpreting SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.
翻译:稀疏自编码器(SAEs)通过将语言模型的隐藏状态分解为可解释的潜在方向,在解释这些状态方面展现出巨大潜力。然而,大规模训练和解释SAEs仍然面临挑战,尤其是在使用大型字典时。尽管解码器可以利用稀疏感知内核来提高效率,但编码器仍需要执行输出维度较大的计算密集型线性运算。为解决这一问题,我们提出KronSAE,一种通过克罗内克积分解对潜在表示进行因式分解的新型架构,从而大幅降低内存和计算开销。此外,我们引入了mAND,一种近似二元AND运算的可微激活函数,该函数在我们的因式分解框架中提升了可解释性和性能。