Recently, contextualized word embeddings outperform static word embeddings on many NLP tasks. However, we still do not know much about the mechanism inside these representations. Do they have any common patterns? If so, where do these patterns come from? We find that almost all the contextualized word vectors of BERT and RoBERTa have a common pattern. For BERT, the $557^{th}$ element is always the smallest. For RoBERTa, the $588^{th}$ element is always the largest, and the $77^{th}$ element is the smallest. We call them "tails" of models. We introduce a new neuron-level method to analyze where these "tails" come from. We find that these "tails" are closely related to the positional information. We also investigate what will happen if we "cutting the tails" (zero-out). Our results show that "tails" are the major cause of anisotropy of vector space. After "cutting the tails", a word's different vectors are more similar to each other. The internal representations have a better ability to distinguish a word's different senses with the word-in-context (WiC) dataset. The performance on the word sense disambiguation task is better for BERT and unchanged for RoBERTa. We can also better induce phrase grammar from the vector space. These suggest that "tails" are less related to the sense and syntax information in vectors. These findings provide insights into the inner workings of contextualized word vectors.
翻译:最近, 背景化的单词嵌入了比NLP 的许多任务中静态的单词嵌入比静态的单词更优。 但是, 我们仍不十分了解这些表达式中的机制。 是否它们有共同的模式? 如果是这样, 这些模式来自哪里? 我们发现几乎所有 BERT 和 RoBERTA 的背景化的单词矢量都有共同的模式。 对于 BERT, 557 {th} $ 元素总是最小的。 对 RobERTA 来说, $ 元素总是最大, 而 $ 元素是最小的。 我们称之为模型中的“ 尾端 ” 。 我们采用新的神经级方法来分析这些“ 尾端” 的来源。 我们发现这些“ 尾端” 和 RobTA 的“ 尾部” 是最小的。 内部表达式更有能力将“ 尾部” 的单词和“ 方向性能” 与“ 方向性能” 更好地区分 BC 。 这些内部表达器能够将“ 感官” 和“ 感官” 的言词与“ 感化” 。 我们的感官与“ 感化” 感官与“ 感官” 和“ 感化” 和“ 感化” 感化” 感化” 感官与“ 感官” 的感官与“ 感官能” 进行更好的感官与“ 。