算法与困难：大型语言模型中动态注意力维护 (Algorithm and Hardness for Dynamic Attention Maintenance in Large Language Models)

Large language models (LLMs) have made fundamental changes in human life. The attention scheme is one of the key components over all the LLMs, such as BERT, GPT-1, Transformers, GPT-2, 3, 3.5 and 4. Inspired by previous theoretical study of static version of the attention multiplication problem [Zandieh, Han, Daliri, and Karbasi arXiv 2023, Alman and Song arXiv 2023]. In this work, we formally define a dynamic version of attention matrix multiplication problem. There are matrices $Q,K, V \in \mathbb{R}^{n \times d}$, they represent query, key and value in LLMs. In each iteration we update one entry in $K$ or $V$. In the query stage, we receive $(i,j) \in [n] \times [d]$ as input, and want to answer $(D^{-1} A V)_{i,j}$, where $A:=\exp(QK^\top) \in \mathbb{R}^{n \times n}$ is a square matrix and $D := \mathrm{diag}(A {\bf 1}_n) \in \mathbb{R}^{n \times n}$ is a diagonal matrix. Here ${\bf 1}_n$ denote a length-$n$ vector that all the entries are ones. We provide two results: an algorithm and a conditional lower bound. $\bullet$ On one hand, inspired by the lazy update idea from [Demetrescu and Italiano FOCS 2000, Sankowski FOCS 2004, Cohen, Lee and Song STOC 2019, Brand SODA 2020], we provide a data-structure that uses $O(n^{\omega(1,1,\tau)-\tau})$ amortized update time, and $O(n^{1+\tau})$ worst-case query time. $\bullet$ On the other hand, show that unless the hinted matrix vector multiplication conjecture [Brand, Nanongkai and Saranurak FOCS 2019] is false, there is no algorithm that can use both $O(n^{\omega(1,1,\tau) - \tau- \Omega(1)})$ amortized update time, and $O(n^{1+\tau-\Omega(1)})$ worst query time. In conclusion, our algorithmic result is conditionally optimal unless hinted matrix vector multiplication conjecture is false.

翻译：大型语言模型（LLMs）已经在人类生活中带来了根本性的变化。注意力机制是所有LLMs的关键组成部分，例如BERT，GPT-1，Transformers，GPT-2，3，3.5和4。在静态注意力乘法问题的先前理论研究的启发下[Zandieh，Han，Daliri和Karbasi arXiv 2023，Alman和Song arXiv 2023]。在这项工作中，我们正式定义了注意力矩阵乘法问题的动态版本。矩阵$Q,K,V\mathbb{R}^{n \times d}$表示LLMs中的查询，键和值。在每次迭代中，我们更新$K$或$V$中的一个条目。在查询阶段，我们接收$(i,j) \in [n]\times[d]$作为输入，并希望回答$(D^{-1}AV)_{i,j}$，其中$A:=\exp(QK^\top) \in \mathbb{R}^{n \times n}$是一个方阵，$D := \mathrm{diag}(A {\bf 1}_n) \in \mathbb{R}^{n \times n}$是一个对角矩阵。这里${\bf 1}_n$表示所有条目都为1的长度为n的向量。我们提供两个结果：一个算法和一个条件下限。$\bullet$ 一方面，受到[Demetrescu和Italiano FOCS 2000，Sankowski FOCS 2004，Cohen，Lee和Song STOC 2019，Brand SODA 2020]中的惰性更新思想的启发，我们提供了一种数据结构，其使用$O(n^{\omega(1,1,\tau)-\tau})$的摊销更新时间，和$O(n^{1+\tau})$最坏查询时间。$\bullet$ 另一方面，显示除非提示的矩阵向量乘法猜想[Brand，Nanongkai和Saranurak FOCS 2019]是错误的，否则没有算法可以同时使用$O(n^{\omega(1,1,\tau)-\tau-\Omega(1)})$的摊销更新时间和$O(n^{1+\tau-\Omega(1)})$的最坏查询时间。总之，我们的算法结果在矩阵向量乘法猜想成立的条件下是有条件的最优的。

相关内容

Omega

关注 17

在Omega中，资源发放是乐观的(optimistic)，每一个应用都发放了所有的可用的资源，冲突是在提交的时候被解决的。Omega的资源管理器，本质上是一个保存着每一个节点的状态关系数据库，并且用不同的乐观并发控制来解决冲突。这样的好处是其大大的提高了调度器的性能(完全的并行，full parallelism)和资源利用率。

【KDD2022】掩码与推理: 用于复杂逻辑查询的预训练知识图谱Transformers

专知会员服务

29+阅读 · 2022年8月12日

牛津大学《多智能体影响图的均衡优化: 理论和实践》，Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice

专知会员服务

26+阅读 · 2022年4月10日