Large language models have become ubiquitous in modern life, finding applications in various domains such as natural language processing, language translation, and speech recognition. Recently, a breakthrough work [Zhao, Panigrahi, Ge, and Arora Arxiv 2023] explains the attention model from probabilistic context-free grammar (PCFG). One of the central computation task for computing probability in PCFG is formulating a particular tensor low rank approximation problem, we can call it tensor cycle rank. Given an $n \times n \times n$ third order tensor $A$, we say that $A$ has cycle rank-$k$ if there exists three $n \times k^2$ size matrices $U , V$, and $W$ such that for each entry in each \begin{align*} A_{a,b,c} = \sum_{i=1}^k \sum_{j=1}^k \sum_{l=1}^k U_{a,i+k(j-1)} \otimes V_{b, j + k(l-1)} \otimes W_{c, l + k(i-1) } \end{align*} for all $a \in [n], b \in [n], c \in [n]$. For the tensor classical rank, tucker rank and train rank, it has been well studied in [Song, Woodruff, Zhong SODA 2019]. In this paper, we generalize the previous ``rotation and sketch'' technique in page 186 of [Song, Woodruff, Zhong SODA 2019] and show an input sparsity time algorithm for cycle rank.
翻译:大型语言模型已经变得无处不在,应用于自然语言处理、语言翻译和语音识别等各个领域。最近一项突破性的工作[Zhao,Panigrahi,Ge和Arora Arxiv 2023]从概率上下文无关的语法(PCFG)解释了注意力模型。计算PCFG中概率的一个中心计算任务是构建一个特定的张量低秩近似问题,我们可以称之为张量循环秩。给定一个$n\times n\times n$三阶张量$A$,如果存在三个$n\times k^2$大小的矩阵$U$、$V$和$W$,使得对于每个条目和每个\begin{align *}A_{a,b,c}=\sum_{i=1}^k\sum_{j=1}^k\sum_{l=1}^kU_{a,i+k(j-1)}\otimes V_{b,j+k(l-1)}\otimes W_{c,l+k(i-1)}\end{align*}均成立,其中$a\in[n]$,$b\in[n]$,$c\in[n]$。对于张量的经典秩、Tucker秩和train秩已经有了深入的研究[Song,Woodruff,Zhong SODA 2019]。在本文中,我们推广了前面的“旋转和草图”技术[Song,Woodruff,Zhong SODA 2019的第186页],展示了一个输入稀疏时间算法的循环秩。