State-of-the-art pretrained contextualized models (PCM) eg. BERT use tasks such as WiC and WSD to evaluate their word-in-context representations. This inherently assumes that performance in these tasks reflect how well a model represents the coupled word and context semantics. We question this assumption by presenting the first quantitative analysis on the context-word interaction being tested in major contextual lexical semantic tasks. To achieve this, we run probing baselines on masked input, and propose measures to calculate and visualize the degree of context or word biases in existing datasets. The analysis was performed on both models and humans. Our findings demonstrate that models are usually not being tested for word-in-context semantics in the same way as humans are in these tasks, which helps us better understand the model-human gap. Specifically, to PCMs, most existing datasets fall into the extreme ends (the retrieval-based tasks exhibit strong target word bias while WiC-style tasks and WSD show strong context bias); In comparison, humans are less biased and achieve much better performance when both word and context are available than with masked input. We recommend our framework for understanding and controlling these biases for model interpretation and future task design.
翻译:BERT使用 WIC 和 WSD 等任务来评价其文字表达方式。 这本质上假定这些任务的表现反映一个模型如何很好地代表了词和背景语义。 我们通过对在主要背景语言语义任务中测试的上下文词互动进行第一次定量分析来质疑这一假设。 为了实现这一目标,我们运行了隐蔽输入的测试基线,并提出了计算和直观现有数据集中上下文或字词偏差程度的措施。 分析既针对模型,也针对人类进行。 我们的研究结果表明,这些模型通常不象人类在这些任务中那样被测试成文词和语义语义。 这有助于我们更好地理解模型与人之间的差距。 具体地说,对于PCMS来说,大多数现有数据集都属于极端目的(基于检索的任务显示强烈的目标词偏差,而WIC 式的任务则显示强烈的背景偏差 ) ; 比较而言,当我们既能理解语言又能理解,又能控制未来任务框架时,人类没有多少偏差。