Improving on the standard of care for diseases is predicated on better treatments, which in turn relies on finding and developing new drugs. However, drug discovery is a complex and costly process. Adoption of methods from machine learning has given rise to creation of drug discovery knowledge graphs which utilize the inherent interconnected nature of the domain. Graph-based data modelling, combined with knowledge graph embeddings provide a more intuitive representation of the domain and are suitable for inference tasks such as predicting missing links. One such example would be producing ranked lists of likely associated genes for a given disease, often referred to as target discovery. It is thus critical that these predictions are not only pertinent but also biologically meaningful. However, knowledge graphs can be biased either directly due to the underlying data sources that are integrated or due to modeling choices in the construction of the graph, one consequence of which is that certain entities can get topologically overrepresented. We show how knowledge graph embedding models can be affected by this structural imbalance, resulting in densely connected entities being highly ranked no matter the context. We provide support for this observation across different datasets, models and predictive tasks. Further, we show how the graph topology can be perturbed to artificially alter the rank of a gene via random, biologically meaningless information. This suggests that such models can be more influenced by the frequency of entities rather than biological information encoded in the relations, creating issues when entity frequency is not a true reflection of underlying data. Our results highlight the importance of data modeling choices and emphasizes the need for practitioners to be mindful of these issues when interpreting model outputs and during knowledge graph composition.
翻译:改善疾病护理标准取决于更好的治疗方法,这反过来又取决于发现和开发新药物。然而,药物发现是一个复杂和昂贵的过程。采用机器学习的方法导致产生药物发现知识图表,利用领域内在的相互联系性质。基于图表的数据建模,加上知识图嵌入,可以更直观地反映领域,适合预测缺失的链接等推论任务。其中一个例子将产生对特定疾病可能相关基因的排名列表,通常被称为目标发现。因此,这些预测不仅相关,而且具有生物意义。采用机器学习的方法,已经产生了药物发现知识图的偏向性,这直接是由于基本数据源的一体化,或是由于在绘制图表时建模的选择,其中一个结果是某些实体可能从表面上看过高地过多地代表了领域,而且适合预测性任务。我们展示了知识图形嵌入模型会如何受到这种结构失衡的影响,导致紧密相连的实体的直系关系,通常被称为目标发现。因此,关键是,这些预测不仅具有相关性,而且具有生物意义。然而,知识图可能直接地偏向上偏向不同的模型、模型和预测性任务产生结果,因此,因此,在生物序列上,我们通过生物结构上可以更需要用一个无意义的模型来判断。我们最深层次的模型来判断,在生物结构上如何的模型中,因此,因此,我们可以通过的模型和生物结构上显示这种结构上如何。我们更需要更能改变。