Biomedical knowledge graphs (BioMedKGs) are essential infrastructures for biomedical and healthcare big data and artificial intelligence (AI), facilitating natural language processing, model development, and data exchange. For decades, these knowledge graphs have been developed via expert curation; however, this method can no longer keep up with today's AI development, and a transition to algorithmically generated BioMedKGs is necessary. In this work, we introduce the Biomedical Informatics Ontology System (BIOS), the first large-scale publicly available BioMedKG generated completely by machine learning algorithms. BIOS currently contains 4.1 million concepts, 7.4 million terms in two languages, and 7.3 million relation triplets. We present the methodology for developing BIOS, including the curation of raw biomedical terms, computational identification of synonymous terms and aggregation of these terms to create concept nodes, semantic type classification of the concepts, relation identification, and biomedical machine translation. We provide statistics on the current BIOS content and perform preliminary assessments of term quality, synonym grouping, and relation extraction. The results suggest that machine learning-based BioMedKG development is a viable alternative to traditional expert curation.
翻译:生物医学知识图(BioMedKGs)是生物医学和保健大数据和人工智能(AI)的基本基础设施,有助于自然语言处理、模型开发和数据交换。数十年来,这些知识图都是通过专家曲线绘制的;然而,这一方法已不再能跟上当今的人工智能开发;有必要向算法生成的生物医学知识图(BioMedKGs)过渡。在这项工作中,我们引入了生物医学信息本体学系统(BIOS)系统,这是首次由机器学习算法完全产生的大规模公开可用的生物医学本体系统。BIOS目前包含410万个概念,740万个术语以两种语言计算,730万个关系三重。我们介绍了开发BIOS系统的方法,包括原始生物医学术语的拼写、计算术语的同义和这些术语的汇总,以创建概念节点、语系型分类、关系识别和生物医学机器翻译。我们提供了关于目前BIOS系统内容的统计数据,并对术语质量、同地名组合和关系提取进行初步评估。结果表明,机器学习BioMedGsideGsal发展是一个可行的传统专家。