自我关注母体的表达力 (On the Expressive Power of Self-Attention Matrices)

Transformer networks are able to capture patterns in data coming from many domains (text, images, videos, proteins, etc.) with little or no change to architecture components. We perform a theoretical analysis of the core component responsible for signal propagation between elements, i.e. the self-attention matrix. In practice, this matrix typically exhibits two properties: (1) it is sparse, meaning that each token only attends to a small subset of other tokens; and (2) it changes dynamically depending on the input to the module. With these considerations in mind, we ask the following question: Can a fixed self-attention module approximate arbitrary sparse patterns depending on the input? How small is the hidden size $d$ required for such approximation? We make progress in answering this question and show that the self-attention matrix can provably approximate sparse matrices, where sparsity is in terms of a bounded number of nonzero elements in each row and column. While the parameters of self-attention are fixed, various sparse matrices can be approximated by only modifying the inputs. Our proof is based on the random projection technique and uses the seminal Johnson-Lindenstrauss lemma. Our proof is constructive, enabling us to propose an algorithm for finding adaptive inputs and fixed self-attention parameters in order to approximate a given matrix. In particular, we show that, in order to approximate any sparse matrix up to a given precision defined in terms of preserving matrix element ratios, $d$ grows only logarithmically with the sequence length $L$ (i.e. $d = O(\log L)$).

翻译：变异器网络能够捕捉来自许多领域(文字、图像、视频、蛋白质等)的数据模式,而其结构组件的长度很少或没有变化。我们对负责元素间信号传播的核心组成部分,即自我注意矩阵进行理论分析。在实践中,该矩阵通常具有两种属性:(1) 它稀少,意味着每个象征只关注其他符号中的一小部分;(2) 它根据模块的输入动态变化。考虑到这些考虑,我们询问以下问题:固定的自我注意模块能否根据输入量而近似任意稀释模式?这种近似所需的隐藏大小是多少美元?我们在回答这一问题上取得进展,并显示自我注意矩阵可以近似于稀薄矩阵,即每个行和列的非零符号数量有限;虽然自我注意参数是固定的,但各种稀薄矩阵只能通过修改输入量来比较。我们的证据基于随机的硬值硬硬硬体- 硬硬体矩阵的隐藏大小比例值?我们用特定的硬体硬体硬体硬体比例值来显示一个固定矩阵的精确度。