Recent advances in efficient Transformers have exploited either the sparsity or low-rank properties of attention matrices to reduce the computational and memory bottlenecks of modeling long sequences. However, it is still challenging to balance the trade-off between model quality and efficiency to perform a one-size-fits-all approximation for different tasks. To better understand this trade-off, we observe that sparse and low-rank approximations excel in different regimes, determined by the softmax temperature in attention, and sparse + low-rank can outperform each individually. Inspired by the classical robust-PCA algorithm for sparse and low-rank decomposition, we propose Scatterbrain, a novel way to unify sparse (via locality sensitive hashing) and low-rank (via kernel feature map) attention for accurate and efficient approximation. The estimation is unbiased with provably low error. We empirically show that Scatterbrain can achieve 2.1x lower error than baselines when serving as a drop-in replacement in BigGAN image generation and pre-trained T2T-ViT. On a pre-trained T2T Vision transformer, even without fine-tuning, Scatterbrain can reduce 98% of attention memory at the cost of only 1% drop in accuracy. We demonstrate Scatterbrain for end-to-end training with up to 4 points better perplexity and 5 points better average accuracy than sparse or low-rank efficient transformers on language modeling and long-range-arena tasks.
翻译:高效变换器最近的进展利用了关注矩阵的宽度或低位特性,减少了模型长序列的计算和记忆瓶颈。然而,要平衡模型质量和效率之间的权衡,以对不同任务实施一刀切的近似值。为了更好地理解这一权衡,我们观察到,稀疏和低级近似值在不同制度中优于不同制度,这取决于软体温度的注意程度,而稀少的+低级别可分别超越每个系统。在传统强效PCA算法的激励下,用于稀散和低级别变异的稀有和低级别变异的计算器,我们建议Scatterbrain,这是将稀少(在对地方敏感的散散散散散)和低级别(低层地特征图)的注意与低级别(低层图)之间的平衡,以准确和高效的方式进行。我们从经验上可以看出,在BigGAN图像生成和预先培训的T2T-T-VT级变压器中,在经过事先训练的T2T级变异性变型变换器中,我们建议采用新的方法,即使是在不精细化的低级变型变式变式变式变换器上将精度培训点,我们只能将精度训练到升级为升级为升级的低级后,在升级的存储器中,将精度培训中,将精度训练到低度训练到低度训练点。