An entity mention in text such as "Washington" may correspond to many different named entities such as the city "Washington D.C." or the newspaper "Washington Post." The goal of named entity disambiguation is to identify the mentioned named entity correctly among all possible candidates. If the type (e.g. location or person) of a mentioned entity can be correctly predicted from the context, it may increase the chance of selecting the right candidate by assigning low probability to the unlikely ones. This paper proposes cluster-based mention typing for named entity disambiguation. The aim of mention typing is to predict the type of a given mention based on its context. Generally, manually curated type taxonomies such as Wikipedia categories are used. We introduce cluster-based mention typing, where named entities are clustered based on their contextual similarities and the cluster ids are assigned as types. The hyperlinked mentions and their context in Wikipedia are used in order to obtain these cluster-based types. Then, mention typing models are trained on these mentions, which have been labeled with their cluster-based types through distant supervision. At the named entity disambiguation phase, first the cluster-based types of a given mention are predicted and then, these types are used as features in a ranking model to select the best entity among the candidates. We represent entities at multiple contextual levels and obtain different clusterings (and thus typing models) based on each level. As each clustering breaks the entity space differently, mention typing based on each clustering discriminates the mention differently. When predictions from all typing models are used together, our system achieves better or comparable results based on randomization tests with respect to the state-of-the-art levels on four defacto test sets.
翻译:“ 华盛顿” 等文本中提及实体可能与许多不同名称实体相对应, 如城市“ 华盛顿特区” 或报纸“ 华盛顿邮报 ” 。 命名实体的模糊化目标是在所有可能的候选人中正确识别被指名实体。 如果从上下文中可以正确预测被指实体的类型(例如地点或人), 则可能增加选择正确候选人的机会, 将概率较低者分配给不太可能被指实体。 本文建议对被指实体的脱节性进行基于集群的标签打字。 提及打字的目的是根据上下文预测某个提及的类型。 一般来说, 使用人工拼写类型分类分类分类法( 如使用维基百科分类等) 的目的是要在所有可能的候选实体中正确辨别。 使用基于背景分类法分类法的每类( 我们使用的每类分类法分类法), 使用基于不同类型分类法的每类分类法测试, 以不同的分类法( 我们使用的每类分类法, 都根据不同的分类法, 使用不同的分类法, 使用不同的分类法, 以不同的分类法, 以不同的分类法, 以不同的分类法 。