How does word frequency in pre-training data affect the behavior of similarity metrics in contextualized BERT embeddings? Are there systematic ways in which some word relationships are exaggerated or understated? In this work, we explore the geometric characteristics of contextualized word embeddings with two novel tools: (1) an identity probe that predicts the identity of a word using its embedding; (2) the minimal bounding sphere for a word's contextualized representations. Our results reveal that words of high and low frequency differ significantly with respect to their representational geometry. Such differences introduce distortions: when compared to human judgments, point estimates of embedding similarity (e.g., cosine similarity) can over- or under-estimate the semantic similarity of two words, depending on the frequency of those words in the training data. This has downstream societal implications: BERT-Base has more trouble differentiating between South American and African countries than North American and European ones. We find that these distortions persist when using BERT-Multilingual, suggesting that they cannot be easily fixed with additional data, which in turn introduces new distortions.
翻译:培训前数据中的字频度如何影响背景化的BERT嵌入中相似度指标的行为? 是否有系统的方法来夸大或低估某些单词关系? 在这项工作中,我们探讨背景化词嵌入的几何特征,有两个新工具:(1) 身份探测器,用嵌入来预测一个单词的身份特征;(2) 一个单词背景化表达的最小界限范围。我们的结果表明,高频和低频词在它们的表达式几何学上差异很大。这些差异带来了扭曲:与人类判断相比,存在相似度(如共线相似度)的点估计数可能高估或低估两个单词的语义相似性,这取决于培训数据的频率。这具有下游的社会影响:BERT-Base在区分南美洲和非洲国家时比北美和欧洲国家更难。我们发现,使用BERT-Multi 语言时,这些扭曲继续存在,表明它们无法轻易地与额外数据固定,这反过来又会产生新的扭曲。