The UMLS Metathesaurus integrates more than 200 biomedical source vocabularies. During the Metathesaurus construction process, synonymous terms are clustered into concepts by human editors, assisted by lexical similarity algorithms. This process is error-prone and time-consuming. Recently, a deep learning model (LexLM) has been developed for the UMLS Vocabulary Alignment (UVA) task. This work introduces UBERT, a BERT-based language model, pretrained on UMLS terms via a supervised Synonymy Prediction (SP) task replacing the original Next Sentence Prediction (NSP) task. The effectiveness of UBERT for UMLS Metathesaurus construction process is evaluated using the UMLS Vocabulary Alignment (UVA) task. We show that UBERT outperforms the LexLM, as well as biomedical BERT-based models. Key to the performance of UBERT are the synonymy prediction task specifically developed for UBERT, the tight alignment of training data to the UVA task, and the similarity of the models used for pretrained UBERT.
翻译:UMLS 元词库将200多个生物医学源词汇整合在一起。在元词库构建过程中,由人类编辑在词汇相似算法的协助下,将同义词组合成概念。这一过程容易出错且耗时。最近,为UMLS 词汇匹配(UVA)任务开发了一个深层学习模型(LexLM)。这项工作引入了UBERT, 以BERT为基础的语言模型,通过监管的合成预测(SP)任务预先培训了UMLS术语。UBERT对UPLS元词库构建过程的有效性,使用UVA任务评估了UBERT。我们表明,UBERT超越了LexLM以及生物医学词汇匹配模型。UBERT的性能是UBERT专门开发的同义预测任务,培训数据与UVA任务密切配合,以及用于UBERT的模型的相似性。