Terpenes are a widespread class of natural products with significant chemical and biological diversity and many of these molecules have already made their way into medicines. Given the thousands of molecules already described, the full characterization of this chemical space can be a challenging task when relying in classical approaches. In this work we employ a data science-based approach to identify, compile and characterize the diversity of terpenes currently known in a systematic way. We worked with a natural product database, COCONUT, from which we extracted information for nearly 60000 terpenes. For these molecules, we conducted a subclass-by-subclass analysis in which we highlight several chemical and physical properties relevant to several fields, such as natural products chemistry, medicinal chemistry and drug discovery, among others. We were also interested in assessing the potential of this data for clustering and classification tasks. For clustering, we have applied and compared k-means with agglomerative clustering, both to the original data and following a step of dimensionality reduction. To this end, PCA, FastICA, Kernel PCA, t-SNE and UMAP were used and benchmarked. We also employed a number of methods for the purpose of classifying terpene subclasses using their physico-chemical descriptors. Light gradient boosting machine, k-nearest neighbors, random forests, Gaussian naiive Bayes and Multilayer perceptron, with the best-performing algorithms yielding accuracy, F1 score, precision and other metrics all over 0.9, thus showing the capabilities of these approaches for the classification of terpene subclasses.
翻译:在这项工作中,我们采用了基于数据科学的方法来识别、汇编和描述目前以系统方式已知的地表多样性。我们与一个天然产品数据库COONUT合作,我们从该数据库中提取了近6000兆字眼的信息。对于这些分子,我们逐级进行了次分析,在分析中我们强调与若干领域相关的若干化学和物理特性,例如自然产品化学、药用化学和药物发现等。我们有兴趣评估这些数据在集群和分类任务方面的潜力。我们采用了基于数据的科学方法来识别、汇编和描述目前以系统方式已知的地表多样性。我们与一个天然产品数据库COONUT合作,我们从该数据库中提取了近6000兆字眼的信息。对于这些分子,我们逐级逐级逐级地分析了信息。我们对这些分子进行了分析,在分析中我们强调了与若干领域相关的化学和物理特性,例如自然产品化学、药用化学化学化学化学和药物发现等相关的化学和物理特性。我们还运用了这些阶梯级的直径直径方法,从而用这些阶梯级的阶梯级的阶梯阶梯级的阶底根,展示了这些阶的阶梯级的阶底。