Transformer networks are able to capture patterns in data coming from many domains (text, images, videos, proteins, etc.) with little or no change to architecture components. We perform a theoretical analysis of the core component responsible for signal propagation between elements, i.e. the self-attention matrix. In practice, this matrix typically exhibits two properties: (1) it is sparse, meaning that each token only attends to a small subset of other tokens; and (2) it changes dynamically depending on the input to the module. With these considerations in mind, we ask the following question: Can a fixed self-attention module approximate arbitrary sparse patterns depending on the input? How small is the hidden size $d$ required for such approximation? We make progress in answering this question and show that the self-attention matrix can provably approximate sparse matrices, where sparsity is in terms of a bounded number of nonzero elements in each row and column. While the parameters of self-attention are fixed, various sparse matrices can be approximated by only modifying the inputs. Our proof is based on the random projection technique and uses the seminal Johnson-Lindenstrauss lemma. Our proof is constructive, enabling us to propose an algorithm for finding adaptive inputs and fixed self-attention parameters in order to approximate a given matrix. In particular, we show that, in order to approximate any sparse matrix up to a given precision defined in terms of preserving matrix element ratios, $d$ grows only logarithmically with the sequence length $L$ (i.e. $d = O(\log L)$).
翻译:变异器网络能够捕捉来自许多领域(文字、图像、视频、蛋白质等)的数据模式,而其结构组件的长度很少或没有变化。我们对负责元素间信号传播的核心组成部分,即自我注意矩阵进行理论分析。在实践中,该矩阵通常具有两种属性:(1) 它稀少,意味着每个象征只关注其他符号中的一小部分;(2) 它根据模块的输入动态变化。考虑到这些考虑,我们询问以下问题:固定的自我注意模块能否根据输入量而近似任意稀释模式?这种近似所需的隐藏大小是多少美元?我们在回答这一问题上取得进展,并显示自我注意矩阵可以近似于稀薄矩阵,即每个行和列的非零符号数量有限;虽然自我注意参数是固定的,但各种稀薄矩阵只能通过修改输入量来比较。我们的证据基于随机的硬值 硬硬硬体- 硬硬体矩阵的隐藏大小比例值?我们用特定的硬体硬体硬体硬体比例值 来显示一个固定矩阵的精确度 。