We present Exact Causal Attention (ECA), a Strassen-style algorithm that computes exact Causal Attention using 10\% fewer operations. ECA improves a special class of matrix multiplications where either one operand or the output matrix is upper- or lower-triangular. This includes all matrix multiplication operations in the forward and backward pass of Causal Attention, such as masked product $\mathrm{Mask}(QK^{T})$. ECA is built upon algebraic identities discovered via machine learning and combinatorial search. We note that ECA cannot accelerate fused kernels such as FlashAttention on GPU. This is because ECA requires materialization of large intermediate expressions in the memory, while FlashAttention does not. However, it provides an alternative approach for compute-bound applications and can potentially be useful in scenarios with FLOPs considerations.
翻译:暂无翻译