Transformers provide a class of expressive architectures that are extremely effective for sequence modeling. However, the key limitation of transformers is their quadratic memory and time complexity $\mathcal{O}(L^2)$ with respect to the sequence length in attention layers, which restricts application in extremely long sequences. Most existing approaches leverage sparsity or low-rank assumptions in the attention matrix to reduce cost, but sacrifice expressiveness. Instead, we propose Combiner, which provides full attention capability in each attention head while maintaining low computation and memory complexity. The key idea is to treat the self-attention mechanism as a conditional expectation over embeddings at each location, and approximate the conditional distribution with a structured factorization. Each location can attend to all other locations, either via direct attention, or through indirect attention to abstractions, which are again conditional expectations of embeddings from corresponding local regions. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention, resulting in the same sub-quadratic cost ($\mathcal{O}(L\log(L))$ or $\mathcal{O}(L\sqrt{L})$). Combiner is a drop-in replacement for attention layers in existing transformers and can be easily implemented in common frameworks. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach, yielding state-of-the-art results on several image and text modeling tasks.
翻译:变压器提供了一组对序列建模极为有效的表达式结构。 然而,变压器的关键限制是其四级内存和时间复杂性$\mathcal{O}(L2/2)$(L2/2),相对于注意层的序列长度,这限制了在极长的序列中的应用。 大多数现有办法利用关注矩阵中的聚度或低位假设来降低成本,但牺牲表达性。 相反,我们提议组合器,它提供每个关注头部的完全关注能力,同时保持低度计算和记忆复杂性。关键的想法是将自留机制作为每个位置嵌入的有条件的预期,并且以结构化因子化而接近有条件的分布。每个地点都可以通过直接关注或通过间接关注来关注所有其它地点,这些地点也是对相关地区嵌入的有条件期望。 我们表明,现有稀薄变压器中最稀少的注意模式能够激发这种充分关注的因子化设计,从而导致相同的次夸度成本($/mathcal{O}(L\log) 和当前双级变压的变压值框架中的变压值(AL_BIL_Bal_) listrual_Bal_Bal_IL_IL_IL_IL_L_L_