Entity typing aims at predicting one or more words that describe the type(s) of a specific mention in a sentence. Due to shortcuts from surface patterns to annotated entity labels and biased training, existing entity typing models are subject to the problem of spurious correlations. To comprehensively investigate the faithfulness and reliability of entity typing methods, we first systematically define distinct kinds of model biases that are reflected mainly from spurious correlations. Particularly, we identify six types of existing model biases, including mention-context bias, lexical overlapping bias, named entity bias, pronoun bias, dependency bias, and overgeneralization bias. To mitigate model biases, we then introduce a counterfactual data augmentation method. By augmenting the original training set with their debiased counterparts, models are forced to fully comprehend sentences and discover the fundamental cues for entity typing, rather than relying on spurious correlations for shortcuts. Experimental results on the UFET dataset show our counterfactual data augmentation approach helps improve generalization of different entity typing models with consistently better performance on both the original and debiased test sets.
翻译:实体打字的目的是预测一个或更多字来描述句子中具体提及的类型。由于从表面模式到附加注释的实体标签和有偏见的培训的捷径,现有实体打字模式存在虚假的相关性问题。为了全面调查实体打字方法的忠诚性和可靠性,我们首先系统地界定主要从虚假的相关性中反映出的不同类型的模式偏差。特别是,我们确定了现有模式偏差的六种类型,包括参考文本偏差、法律重叠偏差、名称实体偏差、表名偏差、依赖偏差和过于笼统的偏差。为了减少模型偏差,我们随后采用了反事实数据增强方法。通过与受偏见的对应方加强原始培训组合,模型被迫充分理解和发现实体打字的基本提示,而不是依靠虚假的对应快捷键。UFET数据集的实验结果显示我们的反事实数据增强方法有助于改进不同实体打字模型的概括化,同时在原始和受偏见测试组中不断提高性能。