Medical entity linking is the task of identifying and standardizing medical concepts referred to in an unstructured text. Most of the existing methods adopt a three-step approach of (1) detecting mentions, (2) generating a list of candidate concepts, and finally (3) picking the best concept among them. In this paper, we probe into alleviating the problem of overgeneration of candidate concepts in the candidate generation module, the most under-studied component of medical entity linking. For this, we present MedType, a fully modular system that prunes out irrelevant candidate concepts based on the predicted semantic type of an entity mention. We incorporate MedType into five off-the-shelf toolkits for medical entity linking and demonstrate that it consistently improves entity linking performance across several benchmark datasets. To address the dearth of annotated training data for medical entity linking, we present WikiMed and PubMedDS, two large-scale medical entity linking datasets, and demonstrate that pre-training MedType on these datasets further improves entity linking performance. We make our source code and datasets publicly available for medical entity linking research.
翻译:医疗实体的链接是确定无结构文本中提到的医疗概念并使之标准化的任务。大多数现有方法都采取了三步方法:(1) 检测提到,(2) 编制候选概念清单,最后(3) 选择其中的最佳概念。在本文件中,我们探究如何减轻候选实体产生模块中候选人概念过多的问题,该模块是医学实体联系中研究最不足的部分。在这方面,我们介绍一个完全模块化的系统MedType,该系统根据一个实体的预测语义类型,将无关的候选概念推出。我们把MedType纳入医疗实体链接的五个现成工具箱中,并表明它不断改进将几个基准数据集的绩效连接起来的实体。为了解决医疗实体链接中缺少附加说明的培训数据的问题,我们提供了WikimoMed和PubMedDS,两个大规模医疗实体连接数据集,并证明在这些数据集上预先培训MedType将进一步改进连接实体。我们提供的来源代码和数据集供连接研究的医学实体公开查阅。