Named entity disambiguation (NED), which involves mapping textual mentions to structured entities, is particularly challenging in the medical domain due to the presence of rare entities. Existing approaches are limited by the presence of coarse-grained structural resources in biomedical knowledge bases as well as the use of training datasets that provide low coverage over uncommon resources. In this work, we address these issues by proposing a cross-domain data integration method that transfers structural knowledge from a general text knowledge base to the medical domain. We utilize our integration scheme to augment structural resources and generate a large biomedical NED dataset for pretraining. Our pretrained model with injected structural knowledge achieves state-of-the-art performance on two benchmark medical NED datasets: MedMentions and BC5CDR. Furthermore, we improve disambiguation of rare entities by up to 57 accuracy points.
翻译:在医学领域,由于存在罕见的实体,现有办法受到限制,因为生物医学知识库中存在粗糙的结构资源,以及使用培训数据集,对非常规资源覆盖面低,因此无法进行分类。在这项工作中,我们提出跨领域数据整合方法,将结构知识从一般文本知识库转移到医疗领域。我们利用我们的整合计划,增加结构资源,为培训前创造大型生物医学NED数据集。我们预先培训的具有注射结构知识的模型,在两种标准医疗NED数据集(MedMentions和BC5CDR)上取得了最先进的业绩。此外,我们还改进了稀有实体的分离,达到57个精确点。