Biomedical research is growing in such an exponential pace that scientists, researchers and practitioners are no more able to cope with the amount of published literature in the domain. The knowledge presented in the literature needs to be systematized in such a ways that claims and hypothesis can be easily found, accessed and validated. Knowledge graphs can provide such framework for semantic knowledge representation from literature. However, in order to build knowledge graph, it is necessary to extract knowledge in form of relationships between biomedical entities and normalize both entities and relationship types. In this paper, we present and compare few rule-based and machine learning-based (Naive Bayes, Random Forests as examples of traditional machine learning methods and T5-based model as an example of modern deep learning) methods for scalable relationship extraction from biomedical literature for the integration into the knowledge graphs. We examine how resilient are these various methods to unbalanced and fairly small datasets, showing that T5 model handles well both small datasets, due to its pre-training on large C4 dataset as well as unbalanced data. The best performing model was T5 model fine-tuned on balanced data, with reported F1-score of 0.88.
翻译:科学、研究人员和从业者无法更有能力应付这一领域已出版的大量文献,因此,生物医学研究正在以如此迅速的速度发展,科学家、研究人员和从业者无法应付这一领域的大量文献。文献中提供的知识需要以易于找到、获取和验证的方式系统化。知识图表可以为文献中的语义知识表述提供这种框架。然而,为了建立知识图,有必要以生物医学实体之间的关系和实体及关系类型的正常化的形式提取知识。本文提出并比较了少数基于规则和机器学习的(自然海湾、随机森林,作为传统机器学习方法的范例和基于T5的模型,作为现代深层学习的范例)方法,从生物医学文献中提取可扩展的关系,以便纳入知识图中。我们研究了这些各种方法对不平衡和相当小的数据集的适应性,表明T5模型处理器和小数据集都是由于对大型C4数据集的预先培训以及不平衡的数据。最佳的模型是T5模型,对平衡数据作了调整,有报告的F1核心为0.88。