Entity typing aims to assign types to the entity mentions in given texts. The traditional classification-based entity typing paradigm has two unignorable drawbacks: 1) it fails to assign an entity to the types beyond the predefined type set, and 2) it can hardly handle few-shot and zero-shot situations where many long-tail types only have few or even no training instances. To overcome these drawbacks, we propose a novel generative entity typing (GET) paradigm: given a text with an entity mention, the multiple types for the role that the entity plays in the text are generated with a pre-trained language model (PLM). However, PLMs tend to generate coarse-grained types after fine-tuning upon the entity typing dataset. Besides, we only have heterogeneous training data consisting of a small portion of human-annotated data and a large portion of auto-generated but low-quality data. To tackle these problems, we employ curriculum learning (CL) to train our GET model upon the heterogeneous data, where the curriculum could be self-adjusted with the self-paced learning according to its comprehension of the type granularity and data heterogeneity. Our extensive experiments upon the datasets of different languages and downstream tasks justify the superiority of our GET model over the state-of-the-art entity typing models. The code has been released on https://github.com/siyuyuan/GET.
翻译:传统的基于分类的实体打字模式有两个不可忽略的缺点:(1) 它未能将一个实体指定为超出预定类型类型以外的类型,(2) 它几乎无法处理许多长尾类只有少量或甚至没有培训实例的少发和零发情况。为了克服这些缺点,我们提议了一个新型的基因化实体打字模式(GET)模式:根据一个有实体提及的文本,该实体在文本中发挥的作用的多种类型是用预先培训的语言模型(PLM)生成的。然而,PLMs在对实体打数据集进行微调后,往往产生粗糙的分类类型。此外,我们只有由少量附加人类数据和大部分自动生成但低质量的数据组成的多种培训数据。为了解决这些问题,我们采用课程学习(CLU)来培训我们获取混杂数据的模式,在这个模式中,课程可以根据自己对质和高压型数据进行自我学习,从而了解类型和高压型数据。 我们的模型/高压型模型的系统测试已经超越了我们的标准。