Entity Set Expansion (ESE) is a critical task aiming to expand entities of the target semantic class described by a small seed entity set. Most existing ESE methods are retrieval-based frameworks that need to extract the contextual features of entities and calculate the similarity between seed entities and candidate entities. To achieve the two purposes, they should iteratively traverse the corpus and the entity vocabulary provided in the datasets, resulting in poor efficiency and scalability. The experimental results indicate that the time consumed by the retrieval-based ESE methods increases linearly with entity vocabulary and corpus size. In this paper, we firstly propose a generative ESE framework, Generative Entity Set Expansion (GenExpan), which utilizes a generative pre-trained language model to accomplish ESE task. Specifically, a prefix tree is employed to guarantee the validity of entity generation, and automatically generated class names are adopted to guide the model to generate target entities. Moreover, we propose Knowledge Calibration and Generative Ranking to further bridge the gap between generic knowledge of the language model and the goal of ESE task. Experiments on publicly available datasets show that GenExpan is efficient and effective. For efficiency, expansion time consumed by GenExpan is independent of entity vocabulary and corpus size, and GenExpan achieves an average 600% speedup compared to strong baselines. For expansion performance, our framework outperforms previous state-of-the-art ESE methods.
翻译:实体集扩展(ESE)是一个重要的任务,旨在扩展由小型种子实体集描述的目标语义类的实体。大多数现有的ESE方法都是基于检索的框架,需要提取实体的上下文特征,并计算种子实体和候选实体之间的相似度。为了实现这两个目的,它们应该迭代地遍历语料库和提供在数据集中的实体词汇,导致效率和可扩展性较差。实验结果表明,基于检索的ESE方法所消耗的时间随实体词汇和语料库大小呈线性增长。在本文中,我们首先提出了一种生成ESE框架,生成实体集扩展(GenExpan),它利用生成预训练语言模型完成ESE任务。具体而言,采用前缀树来保证实体生成的有效性,并采用自动生成的类名来指导模型生成目标实体。此外,我们提出了知识校准和生成排序,以进一步弥合语言模型的通用知识与ESE任务目标之间的差距。公开数据集上的实验表明,GenExpan是高效且有效的。对于效率,GenExpan的扩展时间与实体词汇和语料库大小无关,并且与强基线相比,GenExpan实现了平均600%的加速。对于扩展性能,我们的框架优于先前的最先进ESE方法。