Biomedical knowledge graphs (KGs) hold rich information on entities such as diseases, drugs, and genes. Predicting missing links in these graphs can boost many important applications, such as drug design and repurposing. Recent work has shown that general-domain language models (LMs) can serve as "soft" KGs, and that they can be fine-tuned for the task of KG completion. In this work, we study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction. We evaluate several domain-specific LMs, fine-tuning them on datasets centered on drugs and diseases that we represent as KGs and enrich with textual entity descriptions. We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance. Finally, we demonstrate the advantage of LM models in the inductive setting with novel scientific entities. Our datasets and code are made publicly available.
翻译:生物医学知识图表(KGs)拥有关于疾病、药物和基因等实体的丰富信息。预测这些图表中的缺失环节可以促进许多重要应用,如药物设计和重新定位。最近的工作表明,一般语言模型(LMs)可以作为“软”KGs,并且可以为KG的完成任务进行微调。在这项工作中,我们为KG的完成而研究科学LMs,探讨我们是否能够利用它们的潜在知识加强生物医学联系预测。我们评估了几个特定域的LMs,对以我们作为KGs代表的药物和疾病为核心的数据集进行了微调,并丰富了文本实体描述。我们把基于LMM的模型与KG的嵌入模型结合起来,使用一种路由器方法,学会将每种输入的示例都指定给任一类型的模型,并大大提升了性能。最后,我们展示了LM模型在与新科学实体一起的试导环境中的优势。我们的数据集和代码是公开提供的。