Modern Entity Linking (EL) systems entrench a popularity bias, yet there is no dataset focusing on tail and emerging entities in languages other than English. We present Hansel, a new benchmark in Chinese that fills the vacancy of non-English few-shot and zero-shot EL challenges. The test set of Hansel is human annotated and reviewed, created with a novel method for collecting zero-shot EL datasets. It covers 10K diverse documents in news, social media posts and other web articles, with Wikidata as its target Knowledge Base. We demonstrate that the existing state-of-the-art EL system performs poorly on Hansel (R@1 of 36.6% on Few-Shot). We then establish a strong baseline that scores a R@1 of 46.2% on Few-Shot and 76.6% on Zero-Shot on our dataset. We also show that our baseline achieves competitive results on TAC-KBP2015 Chinese Entity Linking task.
翻译:现代实体链接(EL)系统强化了流行偏好,然而,除了英文之外,没有侧重于尾巴和新兴实体的数据集。我们展示了汉塞尔,这是中国新基准,可以填补非英语的少发和零发EL挑战。汉塞尔测试组是人类附加说明和审查的,以新颖的方法收集零发EL数据集。它涵盖新闻、社交媒体文章和其他网络文章中的10K种不同文件,维基数据是其目标知识库。我们显示,现有最先进的EL系统在汉塞尔上表现不佳(少发36.6%的R@1),然后我们建立了一个强大的基准,在我们的数据集上,小肖特的R@1和零热热的76.6%的R@1。我们还显示,我们的基线在TAC-KBP2015中国实体链接任务上取得了竞争性结果。