组合器: 带有简单计算成本的全引力变形器 (Combiner: Full Attention Transformer with Sparse Computation Cost)

Transformers provide a class of expressive architectures that are extremely effective for sequence modeling. However, the key limitation of transformers is their quadratic memory and time complexity $\mathcal{O}(L^2)$ with respect to the sequence length in attention layers, which restricts application in extremely long sequences. Most existing approaches leverage sparsity or low-rank assumptions in the attention matrix to reduce cost, but sacrifice expressiveness. Instead, we propose Combiner, which provides full attention capability in each attention head while maintaining low computation and memory complexity. The key idea is to treat the self-attention mechanism as a conditional expectation over embeddings at each location, and approximate the conditional distribution with a structured factorization. Each location can attend to all other locations, either via direct attention, or through indirect attention to abstractions, which are again conditional expectations of embeddings from corresponding local regions. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention, resulting in the same sub-quadratic cost ($\mathcal{O}(L\log(L))$ or $\mathcal{O}(L\sqrt{L})$). Combiner is a drop-in replacement for attention layers in existing transformers and can be easily implemented in common frameworks. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach, yielding state-of-the-art results on several image and text modeling tasks.

翻译：变压器提供了一组对序列建模极为有效的表达式结构。然而,变压器的关键限制是其四级内存和时间复杂性$\mathcal{O}(L2/2)$(L2/2),相对于注意层的序列长度,这限制了在极长的序列中的应用。大多数现有办法利用关注矩阵中的聚度或低位假设来降低成本,但牺牲表达性。相反,我们提议组合器,它提供每个关注头部的完全关注能力,同时保持低度计算和记忆复杂性。关键的想法是将自留机制作为每个位置嵌入的有条件的预期,并且以结构化因子化而接近有条件的分布。每个地点都可以通过直接关注或通过间接关注来关注所有其它地点,这些地点也是对相关地区嵌入的有条件期望。我们表明,现有稀薄变压器中最稀少的注意模式能够激发这种充分关注的因子化设计,从而导致相同的次夸度成本($/mathcal{O}(L\log) 和当前双级变压的变压值框架中的变压值(AL_BIL_Bal_) listrual_Bal_Bal_IL_IL_IL_IL_L_L_

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【CVPR2021】预训练图像处理Transformer

专知会员服务

46+阅读 · 2021年6月1日

2021机器学习研究风向是啥？MLP→CNN→Transformer→MLP！

专知会员服务

67+阅读 · 2021年5月23日

【CVPR2021】面向视频动作分割的高效网络结构搜索

专知会员服务

14+阅读 · 2021年3月14日

Transformer替代CNN？8篇论文概述最新进展！

专知会员服务

77+阅读 · 2021年1月19日