头头假设:统一统计方法,以理解德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、德国、 (The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT)

Madhura Pande,Aakriti Budhraja,Preksha Nema,Pratyush Kumar,Mitesh M. Khapra

from arxiv, accepted at AAAI 2021 (Main conference)

Multi-headed attention heads are a mainstay in transformer-based models. Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention. These roles include syntactic (tokens with some syntactic relation), local (nearby tokens), block (tokens in the same sentence) and delimiter (the special [CLS], [SEP] tokens). There are two main challenges with existing methods for classification: (a) there are no standard scores across studies or across functional roles, and (b) these scores are often average quantities measured across sentences without capturing statistical significance. In this work, we formalize a simple yet effective score that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference. This provides us the right lens to systematically analyze attention heads and confidently comment on many commonly posed questions on analyzing the BERT model. In particular, we comment on the co-location of multiple functional roles in the same attention head, the distribution of attention heads across layers, and effect of fine-tuning for specific NLP tasks on these functional roles.

翻译：多头关注头部是变压器模型的支柱。提出了不同方法,根据具有高对称关注的象征之间的关系,对每个关注头的作用进行分类。这些作用包括综合(与某种合成关系相提并论)、本地(近亲象征)、块(同一句中的标记)和分隔器(特殊[CLS]、[SEP]符号),现有分类方法面临两大挑战:(a) 不同研究或不同功能作用之间没有标准分数,以及(b) 这些分数往往是跨刑期平均量的衡量,而没有统计意义。在这项工作中,我们正式确定一个简单而有效的分数,概括了所有关注头部的作用,并对这一分数进行假设测试,以获得稳健的推理。这为我们提供了一种正确的透镜,可以系统分析关注头部,并自信地评论在分析BERT模型时通常提出的许多问题。特别是,我们评论同一头多个职能角色的同位、跨层分配,以及具体NLP任务的微调这些职能任务的影响。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

MIT经典《线性代数》，584页pdf，Introduction to Linear Algebra, Fifth Edition, Gilbert Strang, 2016.

专知会员服务

426+阅读 · 2021年1月11日