Corpus-based set expansion (i.e., finding the "complete" set of entities belonging to the same semantic class, based on a given corpus and a tiny set of seeds) is a critical task in knowledge discovery. It may facilitate numerous downstream applications, such as information extraction, taxonomy induction, question answering, and web search. To discover new entities in an expanded set, previous approaches either make one-time entity ranking based on distributional similarity, or resort to iterative pattern-based bootstrapping. The core challenge for these methods is how to deal with noisy context features derived from free-text corpora, which may lead to entity intrusion and semantic drifting. In this study, we propose a novel framework, SetExpan, which tackles this problem, with two techniques: (1) a context feature selection method that selects clean context features for calculating entity-entity distributional similarity, and (2) a ranking-based unsupervised ensemble method for expanding entity set based on denoised context features. Experiments on three datasets show that SetExpan is robust and outperforms previous state-of-the-art methods in terms of mean average precision.
翻译:以 Corpus 为基础的集束扩展( 即找到属于同一语义类的实体的“ 完整” 组, 以给定体和种子组为基础) 是知识发现的关键任务 。 它可以促进许多下游应用, 如信息提取、 分类上传、 回答和网络搜索 。 要在扩大的集中发现新实体, 以往的方法或者根据分布相似性, 或者采用基于分布式的迭接式穿靴式。 这些方法的核心挑战是如何处理来自自由文本 Corbora 的噪音背景特征, 这可能导致实体的入侵和语义漂移。 在本研究中, 我们提出了一个新颖的框架, 即SetExpan, 解决了这个问题, 采用两种技术:(1) 环境特征选择方法, 选择用于计算实体实体- 实体分布相似性的清洁环境特征, (2) 以基于非注意环境特征的、 以排序为基础的扩展实体的不超标的共性方法。 三个数据集的实验显示SetExptaan 是稳健且超越了先前的平均精确度方法。