Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative overview of the effects of finetuning on the overall embedding space. This approach allows us to draw connections to a wide range of previous studies, from vector space anisotropy to attention weights.
翻译:基于变压器结构的预设嵌入系统通过风暴将NLP群落带入了风暴。 我们显示,在数学上,它们可以被重新设定为矢量因素的总和,并展示如何利用这种组合来研究每个组成部分的影响。 我们提供证据表明,多头关注和进料前进在所有下游应用中并非同样有用,以及微调对整个嵌入空间的影响的量化概览。 这种方法使我们能够将连接到从矢量空间反射到注意重量的多种先前研究。