Transformer architectures have led to remarkable progress in many state-of-art applications. However, despite their successes, modern transformers rely on the self-attention mechanism, whose time- and space-complexity is quadratic in the length of the input. Several approaches have been proposed to speed up self-attention mechanisms to achieve sub-quadratic running time; however, the large majority of these works are not accompanied by rigorous error guarantees. In this work, we establish lower bounds on the computational complexity of self-attention in a number of scenarios. We prove that the time complexity of self-attention is necessarily quadratic in the input length, unless the Strong Exponential Time Hypothesis (SETH) is false. This argument holds even if the attention computation is performed only approximately, and for a variety of attention mechanisms. As a complement to our lower bounds, we show that it is indeed possible to approximate dot-product self-attention using finite Taylor series in linear-time, at the cost of having an exponential dependence on the polynomial order.
翻译:然而,尽管取得了成功,现代变压器依靠的是自留机制,其时间和空间的复杂度在输入的长度上是四倍的。已经提出了几种办法,以加速自留机制,实现次二次二次运行时间;然而,这些工程的大部分没有严格的错误保证。在这项工作中,我们对在若干情况下自留的计算复杂性设定了较低的界限。我们证明,自留的时间复杂性在输入长度上必然是四倍的,除非强显时假(Sethy)是虚假的。这一论点即使只进行大约的注意计算,而且是为了各种注意机制。作为我们较低界限的补充,我们表明,的确有可能在线性时使用有限的泰勒系列来近似多产品自留,其代价是指数依赖多元秩序。