Most words are ambiguous--i.e., they convey distinct meanings in different contexts--and even the meanings of unambiguous words are context-dependent. Both phenomena present a challenge for NLP. Recently, the advent of contextualized word embeddings has led to success on tasks involving lexical ambiguity, such as Word Sense Disambiguation. However, there are few tasks that directly evaluate how well these contextualized embeddings accommodate the more continuous, dynamic nature of word meaning--particularly in a way that matches human intuitions. We introduce RAW-C, a dataset of graded, human relatedness judgments for 112 ambiguous words in context (with 672 sentence pairs total), as well as human estimates of sense dominance. The average inter-annotator agreement (assessed using a leave-one-annotator-out method) was 0.79. We then show that a measure of cosine distance, computed using contextualized embeddings from BERT and ELMo, correlates with human judgments, but that cosine distance also systematically underestimates how similar humans find uses of the same sense of a word to be, and systematically overestimates how similar humans find uses of different-sense homonyms. Finally, we propose a synthesis between psycholinguistic theories of the mental lexicon and computational models of lexical semantics.
翻译:多数字是模糊的,也就是说,它们在不同的背景中传达了不同的含义,甚至明确字的含义也取决于背景。两种现象都给NLP带来了挑战。最近,背景化字嵌入的出现导致在涉及词法模糊性的任务上取得成功,例如Word Sense Disandergulation。然而,没有多少任务直接评估这些背景化字嵌入在多大程度上适应了更连续、更动态的字义含义,特别是以与人类直觉相匹配的方式。我们引入了RAW-C,即112个背景(总共672对判刑)模糊字的分级和人相关判断的数据集,以及人类对感官支配力的估计。平均的批发人间协议(使用休假单词标记方法评估)是0.79。 我们然后展示了一种测量线性距离的尺度,用与BERT和ELMO的相匹配的缩放相匹配的词嵌入模型来计算,与人类的判断相关,但同距离还系统地低估了人类如何使用同样感知的智性思维,最终又系统地利用了一种神经学的模型,并超越了我们所测的宗教的逻辑。