Drug discovery and development is a complex and costly process. Machine learning approaches are being investigated to help improve the effectiveness and speed of multiple stages of the drug discovery pipeline. Of these, those that use Knowledge Graphs (KG) have promise in many tasks, including drug repurposing, drug toxicity prediction and target gene-disease prioritisation. In a drug discovery KG, crucial elements including genes, diseases and drugs are represented as entities, whilst relationships between them indicate an interaction. However, to construct high-quality KGs, suitable data is required. In this review, we detail publicly available sources suitable for use in constructing drug discovery focused KGs. We aim to help guide machine learning and KG practitioners who are interested in applying new techniques to the drug discovery field, but who may be unfamiliar with the relevant data sources. The datasets are selected via strict criteria, categorised according to the primary type of information contained within and are considered based upon what information could be extracted to build a KG. We then present a comparative analysis of existing public drug discovery KGs and a evaluation of selected motivating case studies from the literature. Additionally, we raise numerous and unique challenges and issues associated with the domain and its datasets, whilst also highlighting key future research directions. We hope this review will motivate KGs use in solving key and emerging questions in the drug discovery domain.
翻译:药物发现和开发是一个复杂和昂贵的过程。正在对机构学习方法进行调查,以帮助提高药物发现管道多个阶段的效能和速度。在这些方法中,那些使用知识图(KG)的人在许多任务中都有希望,包括药物重新定位、药物毒性预测和目标基因疾病优先排序。在一个药物发现过程中,包括基因、疾病和药物在内的关键要素被作为实体来代表,而它们之间的关系则表明一种互动。然而,为了建立高质量的KG,需要适当的数据。在本次审查中,我们详细介绍了可用于建立药物发现重点KG的公开可用来源。我们旨在帮助指导那些有兴趣将新技术应用于药物发现领域的机器学习和KG从业人员,但他们可能不熟悉相关数据来源。数据集是通过严格标准挑选的,按照其中所载信息的主要类型进行分类,并根据为建立KG所可以提取的信息来考虑。我们随后对现有的公众药物发现KG进行了比较分析,并对文献中选定的激励性案例研究进行评估。此外,我们还将在研究领域提出大量独特的研究领域和未来的新问题。我们还将在研究领域中提出研究领域和解决这一新出现的问题。