Dot-product attention mechanism plays a crucial role in modern deep architectures (e.g., Transformer) for sequence modeling, however, na\"ive exact computation of this model incurs quadratic time and memory complexities in sequence length, hindering the training of long-sequence models. Critical bottlenecks are due to the computation of partition functions in the denominator of softmax function as well as the multiplication of the softmax matrix with the matrix of values. Our key observation is that the former can be reduced to a variant of the kernel density estimation (KDE) problem, and an efficient KDE solver can be further utilized to accelerate the latter via subsampling-based fast matrix products. Our proposed KDEformer can approximate the attention in sub-quadratic time with provable spectral norm bounds, while all prior results merely provide entry-wise error bounds. Empirically, we verify that KDEformer outperforms other attention approximations in terms of accuracy, memory, and runtime on various pre-trained models. On BigGAN image generation, we achieve better generative scores than the exact computation with over $4\times$ speedup. For ImageNet classification with T2T-ViT, KDEformer shows over $18\times$ speedup while the accuracy drop is less than $0.5\%$.
翻译:在现代深层结构(如变压器)中,多产品注意机制在序列建模的序列建模中发挥着关键作用,然而,对这个模型的精确精确计算在序列长度上会产生四进制时间和记忆复杂性,妨碍长序列模型的培训。关键瓶颈是由于在软模函数分母分母中计算分割函数以及软分子矩阵与数值矩阵的倍增。我们的主要观察是,前者可以降为内核密度估计(KDE)问题的变异,而高效的 KDE 解析器可以通过基于子抽样的快速矩阵产品进一步用于加速后者。我们提议的 KDE Exer 可以用可辨别光谱规范约束在次赤道时间中接近注意力,而所有先前的结果只能提供自入误差的界限。我们核查 KDE Express在准确度、记忆和预培训前各种模型中的其他近似关注值。在BigGAN 图像生成方面,我们实现了更好的基因化$的分数,而其精确度比精确的 KDE\ VIM 速度为 。