引人注意的狂欢 (The Quarks of Attention)

Attention plays a fundamental role in both natural and artificial intelligence systems. In deep learning, attention-based neural architectures, such as transformer architectures, are widely used to tackle problems in natural language processing and beyond. Here we investigate the fundamental building blocks of attention and their computational properties. Within the standard model of deep learning, we classify all possible fundamental building blocks of attention in terms of their source, target, and computational mechanism. We identify and study three most important mechanisms: additive activation attention, multiplicative output attention (output gating), and multiplicative synaptic attention (synaptic gating). The gating mechanisms correspond to multiplicative extensions of the standard model and are used across all current attention-based deep learning architectures. We study their functional properties and estimate the capacity of several attentional building blocks in the case of linear and polynomial threshold gates. Surprisingly, additive activation attention plays a central role in the proofs of the lower bounds. Attention mechanisms reduce the depth of certain basic circuits and leverage the power of quadratic activations without incurring their full cost.

翻译：关注在自然和人工智能系统中都起着根本作用。在深层学习中,关注的神经结构,如变压器结构,被广泛用于解决自然语言处理和处理之外的问题。我们在这里调查关注的基本构件及其计算特性。在标准的深层学习模式中,我们将所有可能关注的基本构件按其来源、目标和计算机制进行分类。我们发现并研究三个最重要的机制:添加活化注意、多复制性输出注意(输出引力)和多复制性合成注意(合成凝胶 ) 。连接机制与标准模型的多复制性扩展相对应,并用于当前所有基于关注的深层学习结构。我们研究它们的功能特性,并估计在线性和多元临界门的情况下几个关注构件的能力。令人惊讶的是,添加性激发注意在下界的证据中起着核心作用。注意机制会降低某些基本电路的深度,并在不承担全部成本的情况下利用二次激活的力量。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【硬核书】矩阵代数基础，248页pdf

专知会员服务

87+阅读 · 2021年12月9日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日