While the successes of transformers across many domains are indisputable, accurate understanding of the learning mechanics is still largely lacking. Their capabilities have been probed on benchmarks which include a variety of structured and reasoning tasks -- but mathematical understanding is lagging substantially behind. Recent lines of work have begun studying representational aspects of this question: that is, the size/depth/complexity of attention-based networks to perform certain tasks. However, there is no guarantee the learning dynamics will converge to the constructions proposed. In our paper, we provide fine-grained mechanistic understanding of how transformers learn "semantic structure", understood as capturing co-occurrence structure of words. Precisely, we show, through a combination of experiments on synthetic data modeled by Latent Dirichlet Allocation (LDA), Wikipedia data, and mathematical analysis that the embedding layer and the self-attention layer encode the topical structure. In the former case, this manifests as higher average inner product of embeddings between same-topic words. In the latter, it manifests as higher average pairwise attention between same-topic words. The mathematical results involve several assumptions to make the analysis tractable, which we verify on data, and might be of independent interest as well.
翻译:虽然在许多领域变压器的成功是无可争议的,但是对学习机理的准确理解在很大程度上仍然缺乏。它们的能力已经根据包括各种结构化和推理性任务在内的基准进行了调查,但数学理解却大大落后。最近的工作线已开始研究这一问题的代表性方面:即关注网络的大小/深度/复杂性,以完成某些任务。然而,不能保证学习动态会与提议的构思相融合。在我们的文件中,我们提供了精细的机械化理解,说明变压器如何学会“结构”,被理解为捕捉同位词结构。准确地说,我们通过将Lenttant Dirichlet分配(LDA)、维基百科数据以及数学分析模型合成数据模型的实验结合起来,表明嵌入层和自留层对主题结构进行编码。在前一种情况下,这表现为同一主题词之间嵌入的较高平均内产物。在后一种词中,它显示同一主题词之间的平均关注度较高。数学结果涉及若干项假设,使我们能够进行独立的数据分析,从而可以核实。</s>