Entity linking is an important problem with many applications. Most previous solutions were designed for settings where annotated training data is available, which is, however, not the case in numerous domains. We propose a light-weight and scalable entity linking method, Eigenthemes, that relies solely on the availability of entity names and a referent knowledge base. Eigenthemes exploits the fact that the entities that are truly mentioned in a document (the "gold entities") tend to form a semantically dense subset of the set of all candidate entities in the document. Geometrically speaking, when representing entities as vectors via some given embedding, the gold entities tend to lie in a low-rank subspace of the full embedding space. Eigenthemes identifies this subspace using the singular value decomposition and scores candidate entities according to their proximity to the subspace. On the empirical front, we introduce multiple strong baselines that compare favorably to the existing state of the art. Extensive experiments on benchmark datasets from a variety of real-world domains showcase the effectiveness of our approach.
翻译:实体联系是许多应用中的一个重要问题。 大多数先前的解决方案是针对具备附加说明的培训数据的设置设计的, 但在许多领域并非如此。 我们建议使用轻量和可扩缩的实体连接方法Eigendexts, 仅依赖实体名称和参考知识库的可用性。 Eigentext 利用文件真正提到的实体(“黄金实体”)往往形成文件中所有候选实体组合的精密密集子集。 从几何角度讲,当实体通过某种特定嵌入作为矢量代表时,金实体往往位于全部嵌入空间的低位子空间。 Eigentexts使用单值分解位置和分数候选实体与子空间的相近来识别这一子空间。 在经验方面,我们引入了多个强有力的基线,与当前艺术状态相比较。 从各种真实世界域对基准数据集进行的广泛实验,展示了我们的方法的有效性。