New powerful tools for tackling life science problems have been created by recent advances in machine learning. The purpose of the paper is to discuss the potential advantages of gene recommendation performed by artificial intelligence (AI). Indeed, gene recommendation engines try to solve this problem: if the user is interested in a set of genes, which other genes are likely to be related to the starting set and should be investigated? This task was solved with a custom deep learning recommendation engine, DeepProphet2 (DP2), which is freely available to researchers worldwide via https://www.generecommender.com?utm_source=DeepProphet2_paper&utm_medium=pdf. Hereafter, insights behind the algorithm and its practical applications are illustrated. The gene recommendation problem can be addressed by mapping the genes to a metric space where a distance can be defined to represent the real semantic distance between them. To achieve this objective a transformer-based model has been trained on a well-curated freely available paper corpus, PubMed. The paper describes multiple optimization procedures that were employed to obtain the best bias-variance trade-off, focusing on embedding size and network depth. In this context, the model's ability to discover sets of genes implicated in diseases and pathways was assessed through cross-validation. A simple assumption guided the procedure: the network had no direct knowledge of pathways and diseases but learned genes' similarities and the interactions among them. Moreover, to further investigate the space where the neural network represents genes, the dimensionality of the embedding was reduced, and the results were projected onto a human-comprehensible space. In conclusion, a set of use cases illustrates the algorithm's potential applications in a real word setting.
翻译:最近机器学习的快速发展为解决生命科学问题提供了新的有力工具。本文旨在讨论借助人工智能(AI)进行基因推荐的潜在优势。基因推荐引擎尝试解决以下问题:如果用户对一组基因感兴趣,哪些其他基因与起始集合可能相关,应予以调查?本文使用自定义深度学习推荐引擎 DeepProphet2(DP2)解决了该任务,该引擎通过https://www.generecommender.com?utm_source=DeepProphet2_paper&utm_medium=pdf向全球研究人员免费提供。接下来,将介绍算法背后的洞见及其实际应用。基因推荐问题可以通过将基因映射到度量空间来解决,在该空间中可以定义距离以表示它们之间的真实语义距离。为达到此目标,我们训练了一个基于变压器的模型,该模型在一个经过精心策划的免费可用论文语料库PubMed上进行了训练。对于特征嵌入大小和网络深度,这篇论文描述了多种优化程序,旨在获得最佳的偏差-方差平衡。在此情况下,通过交叉验证评估了模型发现涉及疾病和途径的一组基因的能力。该过程遵循一个简单的假设:网络没有途径和疾病的直接知识,而是学习了基因之间的相似性和相互作用。此外,为进一步研究神经网络表示基因的空间,降低了嵌入的维度,并将结果投影到人类可理解的空间中。最后,一组用例说明了算法在实际环境中的潜在应用。