Candidate generation is a crucial module in entity linking. It also plays a key role in multiple NLP tasks that have been proven to beneficially leverage knowledge bases. Nevertheless, it has often been overlooked in the monolingual English entity linking literature, as naive approaches obtain very good performance. Unfortunately, the existing approaches for English cannot be successfully transferred to poorly resourced languages. This paper constitutes an in-depth analysis of the candidate generation problem in the context of cross-lingual entity linking with a focus on low-resource languages. Among other contributions, we point out limitations in the evaluation conducted in previous works. We introduce a characterization of queries into types based on their difficulty, which improves the interpretability of the performance of different methods. We also propose a light-weight and simple solution based on the construction of indexes whose design is motivated by more complex transfer learning based neural approaches. A thorough empirical analysis on 9 real-world datasets under 2 evaluation settings shows that our simple solution outperforms the state-of-the-art approach in terms of both quality and efficiency for almost all datasets and query types.
翻译:候选人的产生是实体联系中的一个关键模块,在多种国家劳工计划任务中也发挥着关键作用,这些任务已证明能够有益地利用知识基础。然而,由于单一语言的英语实体将文献联系起来,这常常被忽略,因为天真的方法获得非常良好的业绩。不幸的是,现有的英语方法不能成功地转移到资源贫乏的语言上。本文件深入分析了跨语言实体联系中的候选人产生问题,重点是低资源语言。除其他贡献外,我们指出了以往工作中进行的评估的局限性。我们根据困难对查询类型进行定性,这提高了不同方法的可解释性。我们还根据指数的构建提出了轻量和简单的解决办法,该指数的设计是基于更复杂的基于神经学的转移学习方法。对2个评价环境中9个真实世界数据集进行的彻底经验分析表明,我们简单的解决方案在几乎所有数据集和查询类型的质量和效率方面都超越了最先进的方法。