Entity linking (EL) is the process of linking entity mentions appearing in text with their corresponding entities in a knowledge base. EL features of entities (e.g., prior probability, relatedness score, and entity embedding) are usually estimated based on Wikipedia. However, for newly emerging entities (EEs) which have just been discovered in news, they may still not be included in Wikipedia yet. As a consequence, it is unable to obtain required EL features for those EEs from Wikipedia and EL models will always fail to link ambiguous mentions with those EEs correctly as the absence of their EL features. To deal with this problem, in this paper we focus on a new task of learning EL features for emerging entities in a general way. We propose a novel approach called STAMO to learn high-quality EL features for EEs automatically, which needs just a small number of labeled documents for each EE collected from the Web, as it could further leverage the knowledge hidden in the unlabeled data. STAMO is mainly based on self-training, which makes it flexibly integrated with any EL feature or EL model, but also makes it easily suffer from the error reinforcement problem caused by the mislabeled data. Instead of some common self-training strategies that try to throw the mislabeled data away explicitly, we regard self-training as a multiple optimization process with respect to the EL features of EEs, and propose both intra-slot and inter-slot optimizations to alleviate the error reinforcement problem implicitly. We construct two EL datasets involving selected EEs to evaluate the quality of obtained EL features for EEs, and the experimental results show that our approach significantly outperforms other baseline methods of learning EL features.
翻译:实体链接( EL) 是将实体在文本中提及的内容与其在知识库中的相应实体联系起来的过程。 实体的EL特征( 例如, 先前概率、 关联性评分和实体嵌入) 通常是根据维基百科估算的。 但是, 对于刚刚在新闻中发现的新兴实体( EE) 来说,它们可能仍然没有被包含在维基百科中。 因此,它无法从维基百科和EL模型中为这些 EE EE 获得所需的EL 特性。 因此, 它总是无法从维基百科和EL 模型中为这些 EEE 和EL 模型中正确提及的内容与 EE EE 错误联系起来。 为了解决这个问题,我们专注于为新兴实体学习 EL 特性的新特性。 我们提议了一种叫STAMO 来自动学习高质量的 EL 功能, 这些实体只需要少量的贴标签文件, 因为它可以进一步利用未标数据中隐藏的知识。 STAMO主要基于自我培训, 这使得它与 EL 或EL EL 内部质量 模型的精确地整合 EL 的 EL 的 EL 的特性,, 也很容易地将 EL 的自我强化 推算 。