The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones -- the average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on six downstream tasks. We find that without any input-dependent attention, all models achieve competitive performance -- an average relative drop of only 8% from the probing baseline. Further, little or no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones. Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success. Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture.
翻译:关注机制被视为广泛使用的变异器结构的支柱。 它通过计算特定投入的注意矩阵来将输入内容背景化。 我们发现这个机制虽然强大而优雅,但并不象对预先培训的语言模型通常认为的那样重要。 我们引入了一种新的探索方法PAPA, 将依赖投入的注意矩阵替换为恒定的矩阵 -- -- 即对多个投入的平均关注权。 我们使用PAPA来分析六个下游任务中几个已经成熟的预先培训的变异器。 我们发现,在没有任何依赖投入的注意权的情况下,所有模型都取得了竞争性的性能 -- -- 从预测基线中平均只有8%的相对下降。 此外,在用固定(独立投入的)模式取代一半依赖投入的注意矩阵时,很少或没有观察到性能下降。 有趣的是,我们显示,业绩更好的模型由于采用我们的方法而不是较弱的模型而损失更多,这表明,使用依赖投入的注意机制可能是成功的一个因素。 我们的结果鼓励研究如何更简单地替代依赖投入的注意,以及在变异器结构中更好地利用这一机制的方法。