Most weakly supervised named entity recognition (NER) models rely on domain-specific dictionaries provided by experts. This approach is infeasible in many domains where dictionaries do not exist. While a phrase retrieval model was used to construct pseudo-dictionaries with entities retrieved from Wikipedia automatically in a recent study, these dictionaries often have limited coverage because the retriever is likely to retrieve popular entities rather than rare ones. In this study, a phrase embedding search to efficiently create high-coverage dictionaries is presented. Specifically, the reformulation of natural language queries into phrase representations allows the retriever to search a space densely populated with various entities. In addition, we present a novel framework, HighGEN, that generates NER datasets with high-coverage dictionaries obtained using the phrase embedding search. HighGEN generates weak labels based on the distance between the embeddings of a candidate phrase and target entity type to reduce the noise in high-coverage dictionaries. We compare HighGEN with current weakly supervised NER models on six NER benchmarks and demonstrate the superiority of our models.
翻译:最受监督最弱的命名实体识别(NER)模式依赖于专家提供的域名词典。这种方法在许多领域是行不通的。虽然在最近的一项研究中使用了一个短语检索模型,用从维基百科自动检索的实体建造假词典,但这些词典的覆盖范围往往有限,因为检索器可能检索到受欢迎的实体,而不是罕见的实体。在本研究中,介绍了一个短语嵌入搜索以高效创建高覆盖词典。具体地说,将自然语言查询改写成短语表示法,使检索器能够搜索一个空间,在多个实体中人口稠密。此外,我们提出了一个新的框架,即高GEN,用嵌入式搜索的短语生成高覆盖词典数据集。高GEN根据候选词句和目标实体类型之间的距离生成了薄弱标签,以减少高覆盖词典中的噪音。我们比较了高GEN目前在六个网域基准上受监管薄弱的NER模型,并展示了我们模型的优越性。