Short textual descriptions of entities provide summaries of their key attributes and have been shown to be useful sources of background knowledge for tasks such as entity linking and question answering. However, generating entity descriptions, especially for new and long-tail entities, can be challenging since relevant information is often scattered across multiple sources with varied content and style. We introduce DESCGEN: given mentions spread over multiple documents, the goal is to generate an entity summary description. DESCGEN consists of 37K entity descriptions from Wikipedia and Fandom, each paired with nine evidence documents on average. The documents were collected using a combination of entity linking and hyperlinks to the Wikipedia and Fandom entity pages, which together provide high-quality distant supervision. The resulting summaries are more abstractive than those found in existing datasets and provide a better proxy for the challenge of describing new and emerging entities. We also propose a two-stage extract-then-generate baseline and show that there exists a large gap (19.9% in ROUGE-L) between state-of-the-art models and human performance, suggesting that the data will support significant future work.
翻译:实体的简短文字描述提供了关键属性的概要,并被证明是实体链接和回答问题等任务的背景知识的有用来源。然而,生成实体描述,特别是新实体和长尾实体的描述,可能具有挑战性,因为相关信息往往分散于内容和风格各异的多种来源。我们引入了DESCGEN:在提到分散于多个文件时,目标是生成实体摘要描述。DESCGEN由来自维基百科和方丹的37K实体描述组成,每个实体的描述平均配有9份证据文件。这些文件是结合与维基百科和方丹实体网页的链接和超链接收集的,这些网页共同提供高质量的远程监督。由此产生的摘要比现有数据集中发现的摘要更加抽象,为描述新实体和新兴实体的挑战提供了更好的替代。我们还提议了两阶段的外部遗传基线,并表明在最新模型与人类绩效之间存在巨大差距(在ROUGE-L中为19.9%),表明数据将支持今后的重要工作。