Cosine similarity of contextual embeddings is used in many NLP tasks (e.g., QA, IR, MT) and metrics (e.g., BERTScore). Here, we uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgements, cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts, even after controlling for polysemy and other factors. We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words and provide a formal argument for the two-dimensional case.
翻译:许多《国家劳工政策计划》任务(例如,QA、IR、MT)和衡量标准(例如,BERTScore)都使用相近的相近嵌入物。在这里,我们发现有系统的方法,通过对 BERT 嵌入物的cosine 估计的同义词相似性被低估,并将这种效应追溯到数据培训频率。我们发现,与人类判断相比,相近性低估了经常使用的词与其他词或其它词的相似性,即使在控制了聚苯乙烯和其他因素之后也是如此。我们推测,高频单词的相似性低估是因为高频和低频字的表示几何方法不同,并为二维案例提供了正式论据。