Coreference resolution models are often evaluated on multiple datasets. Datasets vary, however, in how coreference is realized -- i.e., how the theoretical concept of coreference is operationalized in the dataset -- due to factors such as the choice of corpora and annotation guidelines. We investigate the extent to which errors of current coreference resolution models are associated with existing differences in operationalization across datasets (OntoNotes, PreCo, and Winogrande). Specifically, we distinguish between and break down model performance into categories corresponding to several types of coreference, including coreferring generic mentions, compound modifiers, and copula predicates, among others. This break down helps us investigate how state-of-the-art models might vary in their ability to generalize across different coreference types. In our experiments, for example, models trained on OntoNotes perform poorly on generic mentions and copula predicates in PreCo. Our findings help calibrate expectations of current coreference resolution models; and, future work can explicitly account for those types of coreference that are empirically associated with poor generalization when developing models.
翻译:共同参考分辨率模型往往在多个数据集上进行评估。但是,由于选择公司和注释准则等因素,在如何实现共同参考方面,数据集各不相同,例如,如何在数据集中落实共同参考的理论概念。我们调查了当前共同参考分辨率模型的错误在多大程度上与跨数据集操作方面的现有差异相关联(OntoNotes、PreCo和Winogrande)。具体地说,我们区分了模型性能,将其分为几类,与若干类共同参考相对应,包括核心通用名词、复合修饰剂和椰木前导等。这一细分有助于我们调查各种最先进的模型在不同类别共同参照能力方面可能存在何种差异。例如,在我们的实验中,关于Onto Notes的训练模型在通用名词和PreCogrand的冠词上表现不佳。我们的调查结果有助于校正当前共同参考分辨率模型的预期值;以及,未来的工作可以明确说明这些类型的共同参照在开发模型时经验上与较差的概括性相关联。</s>