电变器满足了存储区块模型:注意数据-数据-适应性差分和成本 (Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost)

To overcome the quadratic cost of self-attention, recent works have proposed various sparse attention modules, most of which fall under one of two groups: 1) sparse attention under a hand-crafted patterns and 2) full attention followed by a sparse variant of softmax such as $\alpha$-entmax. Unfortunately, the first group lacks adaptability to data while the second still requires quadratic cost in training. In this work, we propose SBM-Transformer, a model that resolves both problems by endowing each attention head with a mixed-membership Stochastic Block Model (SBM). Then, each attention head data-adaptively samples a bipartite graph, the adjacency of which is used as an attention mask for each input. During backpropagation, a straight-through estimator is used to flow gradients beyond the discrete sampling step and adjust the probabilities of sampled edges based on the predictive loss. The forward and backward cost are thus linear to the number of edges, which each attention head can also choose flexibly based on the input. By assessing the distribution of graphs, we theoretically show that SBM-Transformer is a universal approximator for arbitrary sequence-to-sequence functions in expectation. Empirical evaluations under the LRA and GLUE benchmarks demonstrate that our model outperforms previous efficient variants as well as the original Transformer with full attention. Our implementation can be found in https://github.com/sc782/SBM-Transformer .

翻译：为了克服自我关注的二次成本,最近的工作提出了各种分散关注模块,其中多数属于两类中的一组:(1) 手工制作模式下的注意力稀少,(2) 充分关注,然后是零散的软max变体,例如$\alpha$-entmax。不幸的是,第一组缺乏对数据的适应性,而第二组仍然需要二次培训的二次费用。在这项工作中,我们提出了SBM-Transforent模型,该模型通过将每个关注对象的头与混合成员结构块模型(SBM)联系起来来解决这两个问题。然后,每个关注对象都对数据进行双向调样图,其相邻性被用作每种输入的注意面面罩。在反向调整过程中,一个直通度估计用于离采样阶段之外的梯度,并根据预测模式调整抽样边缘的概率。因此,前向和后向成本是直线式的,每个关注对象也可以根据输入的双向数据样本样本样本样本样本样本样本进行灵活选择。通过对前向方向/前向方向分析,我们用前向方向显示前向方向的图像显示前向方向的顺序,我们作为正向方向的版本的图像显示,我们作为正向方向,我们在前向前向前向前向前向前向方向的版本的状态显示。我们在前向方向显示,我们在前向方向的状态中,我们在前向后向后向后向的运行的状态的功能的功能显示。我们向,在前向的功能的功能显示,在前向,在前向,在前向后向后向中显示我们在前向中显示,在前向中显示,在前向后向后向后向中显示在前向,在前向中,在前向中显示在前向后向中显示在前向中显示我们在前向中显示。在前向中显示。在前向中,在前向中显示,在前向,在前向,在前向中,在前向中显示,在前向中显示。在前向中显示,在前向后向中,在前向中显示,在前向中显示,在前向后向中显示,在前向后向后向后向后向,在前向,在前向,在前向,在前向后向后向后向