New powerful tools for tackling life science problems have been created by recent advances in machine learning. The purpose of the paper is to discuss the potential advantages of gene recommendation performed by artificial intelligence (AI). Indeed, gene recommendation engines try to solve this problem: if the user is interested in a set of genes, which other genes are likely to be related to the starting set and should be investigated? This task was solved with a custom deep learning recommendation engine, DeepProphet2 (DP2), which is freely available to researchers worldwide via www.generecommender.com. Hereafter, insights behind the algorithm and its practical applications are illustrated. The gene recommendation problem can be addressed by mapping the genes to a metric space where a distance can be defined to represent the real semantic distance between them. To achieve this objective a transformer-based model has been trained on a well-curated freely available paper corpus, PubMed. The paper describes multiple optimization procedures that were employed to obtain the best bias-variance trade-off, focusing on embedding size and network depth. In this context, the model's ability to discover sets of genes implicated in diseases and pathways was assessed through cross-validation. A simple assumption guided the procedure: the network had no direct knowledge of pathways and diseases but learned genes' similarities and the interactions among them. Moreover, to further investigate the space where the neural network represents genes, the dimensionality of the embedding was reduced, and the results were projected onto a human-comprehensible space. In conclusion, a set of use cases illustrates the algorithm's potential applications in a real word setting.
翻译:解决生命科学问题的新有力工具是最近机器学习的进展所创造的。 本文的目的是讨论人工智能(AI)所实施的基因建议的潜在好处。 事实上, 基因建议引擎试图解决这个问题: 如果用户对一组基因感兴趣, 其他基因可能与启动的一组基因有关, 应该调查这些基因? 这项任务是通过一个定制的深层次学习建议引擎Deep Prophet2 (DP2)来解决的, 全世界研究人员可以通过www.generecommender.com. 免费获得这个工具。 后世, 算法及其实际应用背后的洞察力得到说明。 基因建议问题可以通过将基因定位到一个测量空间空间空间来解决, 从而可以确定它们之间的真正语系距离。 为了实现这一目标, 一个基于变压器的模型已经通过一个精密的可自由获取的纸质资料库( PubMed) 。 该文件描述了用来获得最佳偏差交易的多重优化程序, 重点是嵌入大小和网络深度。 在这种背景下, 模型能够发现基因应用的基因应用, 从而可以代表了疾病和路径的路径, 一个直接理解的路径的模型被评估了 。 。 网络的路径被评估了 和路径 。 。 在 的模型中, 一个循环中, 一个循环中, 一个循环中, 和路径被理解的路径被理解的路径被理解了 的路径被 的路径被评估了,, 的路径被进一步的路径被理解到的路径被评估了。