The drug discovery and development process is a long and expensive one, costing over 1 billion USD on average per drug and taking 10-15 years. To reduce the high levels of attrition throughout the process, there has been a growing interest in applying machine learning methodologies to various stages of drug discovery and development in the recent decade, especially at the earliest stage identification of druggable disease genes. In this paper, we have developed a new tensor factorisation model to predict potential drug targets (genes or proteins) for treating diseases. We created a three dimensional data tensor consisting of 1,048 gene targets, 860 diseases and 230,011 evidence attributes and clinical outcomes connecting them, using data extracted from the Open Targets and PharmaProjects databases. We enriched the data with gene target representations learned from a drug discovery oriented knowledge graph and applied our proposed method to predict the clinical outcomes for unseen gene target and disease pairs. We designed three evaluation strategies to measure the prediction performance and benchmarked several commonly used machine learning classifiers together with Bayesian matrix and tensor factorisation methods. The result shows that incorporating knowledge graph embeddings significantly improves the prediction accuracy and that training tensor factorisation alongside a dense neural network outperforms all other baselines. In summary, our framework combines two actively studied machine learning approaches to disease target identification, namely tensor factorisation and knowledge graph representation learning, which could be a promising avenue for further exploration in data driven drug discovery.
翻译:药物发现和开发过程是一个漫长而昂贵的过程,每个药物平均花费10亿多美元,耗时10-15年。为了在整个过程中降低高自然减员水平,人们越来越有兴趣将机器学习方法应用于近十年来药物发现和发展的各个阶段,特别是在早期确定可药性疾病基因阶段。在本文件中,我们开发了一种新的刺激因素模型,以预测潜在的药物目标(基因或蛋白)治疗疾病。我们创建了三维数据梯度,包括1,048个基因目标、860种疾病和230,011证据属性和临床结果。利用开放目标项目和法玛项目数据库提取的数据,我们利用从药物发现知识图中获取的基因目标表示方式丰富了最近十年药物发现和发展各阶段药物发现和发展的各个阶段,并运用了我们提出的方法来预测隐性基因目标和疾病配对的临床结果。我们设计了三项评价战略,以衡量预测业绩,并将一些常用的机器进一步学习分类方法与Bayesian 矩阵和加固化方法一起为基准。结果显示,将知识嵌入其他预测精确度和训练沙压性数据模型,即将沙压性模型化与密集性模型结合。