神经网络中反复发生和自我关注之间的不相矛盾的权衡取舍 (Untangling tradeoffs between recurrence and self-attention in neural networks)

Attention and self-attention mechanisms, are now central to state-of-the-art deep learning on sequential tasks. However, most recent progress hinges on heuristic approaches with limited understanding of attention's role in model optimization and computation, and rely on considerable memory and computational resources that scale poorly. In this work, we present a formal analysis of how self-attention affects gradient propagation in recurrent networks, and prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies by establishing concrete bounds for gradient norms. Building on these results, we propose a relevancy screening mechanism, inspired by the cognitive process of memory consolidation, that allows for a scalable use of sparse self-attention with recurrence. While providing guarantees to avoid vanishing gradients, we use simple numerical experiments to demonstrate the tradeoffs in performance and computational resources by efficiently balancing attention and recurrence. Based on our results, we propose a concrete direction of research to improve scalability of attentive networks.

翻译：关注和自我关注机制现在对于在相继任务上进行最先进的深入学习至关重要。然而,最近的进展取决于对关注在模式优化和计算中的作用认识有限、依赖大量记忆和计算资源,而且依赖规模不高的庞大记忆和计算资源。在这项工作中,我们正式分析了自我关注如何影响经常性网络中的梯度传播,并证明在试图通过为梯度规范确定具体界限来捕捉长期依赖性时,可以减轻梯度消失的问题。在这些结果的基础上,我们提议了一个具有相关性的筛选机制,在记忆整合认知过程的启发下,允许以可伸缩的方式利用稀少的自我意识,并重现。我们在提供避免梯度消失的保证的同时,使用简单的数字实验,通过有效平衡关注和重现,来展示在业绩和计算资源上的权衡。我们根据我们的结果,提出具体的研究方向,以提高关注网络的可扩展性。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【万字长文】注意力机制可解释大论述

专知会员服务

55+阅读 · 2020年11月17日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

80+阅读 · 2020年7月26日

【DeepMind深度学习课程】序列循环神经网络，141页ppt，Sequences and Recurrent Network

专知会员服务

86+阅读 · 2020年6月23日