梯田的化学空间:数据科学和AI的洞见 (The chemical space of terpenes: insights from data science and AI)

Terpenes are a widespread class of natural products with significant chemical and biological diversity and many of these molecules have already made their way into medicines. Given the thousands of molecules already described, the full characterization of this chemical space can be a challenging task when relying in classical approaches. In this work we employ a data science-based approach to identify, compile and characterize the diversity of terpenes currently known in a systematic way. We worked with a natural product database, COCONUT, from which we extracted information for nearly 60000 terpenes. For these molecules, we conducted a subclass-by-subclass analysis in which we highlight several chemical and physical properties relevant to several fields, such as natural products chemistry, medicinal chemistry and drug discovery, among others. We were also interested in assessing the potential of this data for clustering and classification tasks. For clustering, we have applied and compared k-means with agglomerative clustering, both to the original data and following a step of dimensionality reduction. To this end, PCA, FastICA, Kernel PCA, t-SNE and UMAP were used and benchmarked. We also employed a number of methods for the purpose of classifying terpene subclasses using their physico-chemical descriptors. Light gradient boosting machine, k-nearest neighbors, random forests, Gaussian naiive Bayes and Multilayer perceptron, with the best-performing algorithms yielding accuracy, F1 score, precision and other metrics all over 0.9, thus showing the capabilities of these approaches for the classification of terpene subclasses.

翻译：在这项工作中,我们采用了基于数据科学的方法来识别、汇编和描述目前以系统方式已知的地表多样性。我们与一个天然产品数据库COONUT合作,我们从该数据库中提取了近6000兆字眼的信息。对于这些分子,我们逐级进行了次分析,在分析中我们强调与若干领域相关的若干化学和物理特性,例如自然产品化学、药用化学和药物发现等。我们有兴趣评估这些数据在集群和分类任务方面的潜力。我们采用了基于数据的科学方法来识别、汇编和描述目前以系统方式已知的地表多样性。我们与一个天然产品数据库COONUT合作,我们从该数据库中提取了近6000兆字眼的信息。对于这些分子,我们逐级逐级逐级地分析了信息。我们对这些分子进行了分析,在分析中我们强调了与若干领域相关的化学和物理特性,例如自然产品化学、药用化学化学化学化学和药物发现等相关的化学和物理特性。我们还运用了这些阶梯级的直径直径方法,从而用这些阶梯级的阶梯级的阶梯阶梯级的阶底根,展示了这些阶的阶梯级的阶底。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

【数据科学导论书】Introduction to Datascience，253页pdf

专知会员服务

50+阅读 · 2021年11月15日

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日