Large language models (LLMs) have made fundamental changes in human life. The attention scheme is one of the key components over all the LLMs, such as BERT, GPT-1, Transformers, GPT-2, 3, 3.5 and 4. Inspired by previous theoretical study of static version of the attention multiplication problem [Zandieh, Han, Daliri, and Karbasi arXiv 2023, Alman and Song arXiv 2023]. In this work, we formally define a dynamic version of attention matrix multiplication problem. There are matrices $Q,K, V \in \mathbb{R}^{n \times d}$, they represent query, key and value in LLMs. In each iteration we update one entry in $K$ or $V$. In the query stage, we receive $(i,j) \in [n] \times [d]$ as input, and want to answer $(D^{-1} A V)_{i,j}$, where $A:=\exp(QK^\top) \in \mathbb{R}^{n \times n}$ is a square matrix and $D := \mathrm{diag}(A {\bf 1}_n) \in \mathbb{R}^{n \times n}$ is a diagonal matrix. Here ${\bf 1}_n$ denote a length-$n$ vector that all the entries are ones. We provide two results: an algorithm and a conditional lower bound. $\bullet$ On one hand, inspired by the lazy update idea from [Demetrescu and Italiano FOCS 2000, Sankowski FOCS 2004, Cohen, Lee and Song STOC 2019, Brand SODA 2020], we provide a data-structure that uses $O(n^{\omega(1,1,\tau)-\tau})$ amortized update time, and $O(n^{1+\tau})$ worst-case query time. $\bullet$ On the other hand, show that unless the hinted matrix vector multiplication conjecture [Brand, Nanongkai and Saranurak FOCS 2019] is false, there is no algorithm that can use both $O(n^{\omega(1,1,\tau) - \tau- \Omega(1)})$ amortized update time, and $O(n^{1+\tau-\Omega(1)})$ worst query time. In conclusion, our algorithmic result is conditionally optimal unless hinted matrix vector multiplication conjecture is false.
翻译:大型语言模型(LLMs)已经在人类生活中带来了根本性的变化。注意力机制是所有LLMs的关键组成部分,例如BERT,GPT-1,Transformers,GPT-2,3,3.5和4。在静态注意力乘法问题的先前理论研究的启发下[Zandieh,Han,Daliri和Karbasi arXiv 2023,Alman和Song arXiv 2023]。在这项工作中,我们正式定义了注意力矩阵乘法问题的动态版本。矩阵$Q,K,V\mathbb{R}^{n \times d}$表示LLMs中的查询,键和值。在每次迭代中,我们更新$K$或$V$中的一个条目。在查询阶段,我们接收$(i,j) \in [n]\times[d]$作为输入,并希望回答$(D^{-1}AV)_{i,j}$,其中$A:=\exp(QK^\top) \in \mathbb{R}^{n \times n}$是一个方阵,$D := \mathrm{diag}(A {\bf 1}_n) \in \mathbb{R}^{n \times n}$是一个对角矩阵。这里${\bf 1}_n$表示所有条目都为1的长度为n的向量。我们提供两个结果:一个算法和一个条件下限。$\bullet$ 一方面,受到[Demetrescu和Italiano FOCS 2000,Sankowski FOCS 2004,Cohen,Lee和Song STOC 2019,Brand SODA 2020]中的惰性更新思想的启发,我们提供了一种数据结构,其使用$O(n^{\omega(1,1,\tau)-\tau})$的摊销更新时间,和$O(n^{1+\tau})$最坏查询时间。$\bullet$ 另一方面,显示除非提示的矩阵向量乘法猜想[Brand,Nanongkai和Saranurak FOCS 2019]是错误的,否则没有算法可以同时使用$O(n^{\omega(1,1,\tau)-\tau-\Omega(1)})$的摊销更新时间和$O(n^{1+\tau-\Omega(1)})$的最坏查询时间。总之,我们的算法结果在矩阵向量乘法猜想成立的条件下是有条件的最优的。