State-of-the-art contextualized models eg. BERT use tasks such as WiC and WSD to evaluate their word-in-context representations. This inherently assumes that performance in these tasks reflect how well a model represents the coupled word and context semantics. We question this assumption by presenting the first quantitative analysis on the context-word interaction required and being tested in major contextual lexical semantic tasks, taking into account that tasks can be inherently biased and models can learn spurious correlations from datasets. To achieve this, we run probing baselines on masked input, based on which we then propose measures to calculate the degree of context or word biases in a dataset, and plot existing datasets on a continuum. The analysis were performed on both models and humans to decouple biases inherent to the tasks and biases learned from the datasets. We found that, (1) to models, most existing datasets fall into the extreme ends of the continuum: the retrieval-based tasks and especially the ones in the medical domain (eg. COMETA) exhibit strong target word bias while WiC-style tasks and WSD show strong context bias; (2) AM2iCo and Sense Retrieval show less extreme model biases and challenge a model more to represent both the context and target words. (3) A similar trend of biases exists in humans but humans are much less biased compared with models as humans found semantic judgments more difficult with the masked input, indicating models are learning spurious correlations. This study demonstrates that with heavy context or target word biases, models are usually not being tested for word-in-context representations as such in these tasks and results are therefore open to misinterpretation. We recommend our framework as a sanity check for context and target word biases in future task design and model interpretation in lexical semantics.
翻译:例如, BERT 使用 WIC 和 WSD 等任务来评估其文字和文字表达方式。 这本质上假定这些任务的表现反映模型如何很好地代表了同时的词和语境语义。 我们质疑这一假设,方法是对背景词互动进行第一次定量分析,在主要背景语义语义语义任务中测试,考虑到任务本身可能带有偏见,模型可以从数据集中学习虚假的关联。 为了实现这一点,我们用隐藏的输入方式来计算其语言偏差,我们用隐藏的输入方式来计算它们的文字表达方式。 在此基础上,我们提出计算数据集中上下文或字义偏差程度的措施,然后在连续运行中绘制现有的数据集。 我们发现,在模型中,大多数现有的数据集都属于连续作业的极端端: 以检索为基础的任务, 特别是医学领域( 例如, ICETA) 显示强烈的目标词偏差, 而WIC 格式任务和 WSDD 显示的是更强烈的直线性表示的直径直观。