Previous work on Entity Linking has focused on resources targeting non-nested proper named entity mentions, often in data from Wikipedia, i.e. Wikification. In this paper, we present and evaluate WikiGUM, a fully wikified dataset, covering all mentions of named entities, including their non-named and pronominal mentions, as well as mentions nested within other mentions. The dataset covers a broad range of 12 written and spoken genres, most of which have not been included in Entity Linking efforts to date, leading to poor performance by a pretrained SOTA system in our evaluation. The availability of a variety of other annotations for the same data also enables further research on entities in context.
翻译:此前关于实体联系的工作侧重于针对非指定适当名称实体的资源,常常在Wikipedia的数据中提及,即WikiGUM。本文介绍并评价一个完整的维基百科数据集,涵盖所有提及被点名的实体,包括其未命名和标语的提及,并提及在其他提及中嵌套。该数据集涵盖广泛的12种书面和口语类型,其中大多数尚未纳入实体联系工作,导致我们经过事先培训的SOTA系统在评估中表现不佳。同一数据的多种其他说明的提供也使得能够就相关实体开展进一步的研究。