Existing work on Entity Linking mostly assumes that the reference knowledge base is complete, and therefore all mentions can be linked. In practice this is hardly ever the case, as knowledge bases are incomplete and because novel concepts arise constantly. This paper created the Unknown Entity Discovery and Indexing (EDIN) benchmark where unknown entities, that is entities without a description in the knowledge base and labeled mentions, have to be integrated into an existing entity linking system. By contrasting EDIN with zero-shot entity linking, we provide insight on the additional challenges it poses. Building on dense-retrieval based entity linking, we introduce the end-to-end EDIN pipeline that detects, clusters, and indexes mentions of unknown entities in context. Experiments show that indexing a single embedding per entity unifying the information of multiple mentions works better than indexing mentions independently.
翻译:有关实体链接的现有工作大多假设参考知识库是完整的,因此所有提及的内容都可以联系起来。在实践中,情况并非如此,因为知识库不完整,而且新概念不断出现。本文创建了未知实体发现和索引基准(EDIN),其中未知实体,即知识库中没有描述和标签提到的实体,必须融入现有的实体链接系统。通过将EDIN与零光实体链接进行对比,我们深入了解它带来的额外挑战。在基于密集检索的实体连接的基础上,我们引入了在背景中检测、集群和索引提及未知实体的端到端的EDIN管道。实验显示,将单个合并多个引用信息的实体编制索引比独立引用索引更好。