With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on representation, we establish the existence of a sufficient and minimal representation of input tokens. In particular, such a representation instantiates the posterior distribution of the latent variable given input tokens, which plays a central role in predicting output labels and solving downstream tasks. - To answer (b) on inference, we prove that attention with the desired parameter infers the latent posterior up to an approximation error, which is decreasing in input sizes. In detail, we quantify how attention approximates the conditional mean of the value given the key, which characterizes how it performs relational inference over long sequences. - To answer (c) on learning, we prove that both supervised and self-supervised objectives allow empirical risk minimization to learn the desired parameter up to a generalization error, which is independent of input sizes. Particularly, in the self-supervised setting, we identify a condition number that is pivotal to solving downstream tasks.
翻译:通过关注机制,变压器取得了显著的经验性成功。尽管人们直观地理解变压器对长序列进行关联推导,以产生可取的表达方式,但我们缺乏关于注意机制如何实现的严格理论。特别是,一些令人感兴趣的问题仍然有待解决:(a) 是什么使一个理想的表述方式?(b) 注意机制如何推断出在远端通道内的适当表述方式?(c) 培训前程序如何通过后端通道推断出理想的表述方式?我们注意到,正如在BERT和ViT中的情况那样,输入符号往往可以互换,因为它们已经包括了位置编码。 互换性概念引发了一个潜伏变量模型,而这种模型与投入大小不易变换,从而使我们能够进行理论分析。 为了回答(a) 关于代表性,我们如何确定一个足够和最低限度的输入符号的存在? 特别是,这样的表达方式如何即隐性变量的后端分布,这在预测输出标签和下游任务中起着核心作用。 (b) 判断,在正向的精确度上,我们如何解读了我们如何理解一个核心的排序。