An attention matrix of a transformer self-attention sublayer can provably be decomposed into two components and only one of them (effective attention) contributes to the model output. This leads us to ask whether visualizing effective attention gives different conclusions than interpretation of standard attention. Using a subset of the GLUE tasks and BERT, we carry out an analysis to compare the two attention matrices, and show that their interpretations differ. Effective attention is less associated with the features related to the language modeling pretraining such as the separator token, and it has more potential to illustrate linguistic features captured by the model for solving the end-task. Given the found differences, we recommend using effective attention for studying a transformer's behavior since it is more pertinent to the model output by design.
翻译:变压器自我注意子层的注意矩阵可以被分解成两个部分, 其中只有一个部分( 有效的注意) 有助于模型输出。 这导致我们询问视觉化的有效注意是否提供了不同于对标准注意的解释的结论。 我们使用 GLUE 任务和 BERT 的子集, 进行分析, 比较两种注意矩阵, 并显示其解释不尽相同 。 有效注意与诸如分隔符等语言模拟训练前训练前的特征有关, 并且它更有可能说明解决最终任务的模式所捕捉的语言特征 。 鉴于发现的差异, 我们建议使用有效的注意研究变压器的行为, 因为它与设计模型输出更相关 。