Artificial intelligence (AI) has been widely applied in drug discovery with a major task as molecular property prediction. Despite the boom of AI techniques in molecular representation learning, some key aspects underlying molecular property prediction haven't been carefully examined yet. In this study, we conducted a systematic comparison on three representative models, random forest, MolBERT and GROVER, which utilize three major molecular representations, extended-connectivity fingerprints, SMILES strings and molecular graphs, respectively. Notably, MolBERT and GROVER, are pretrained on large-scale unlabelled molecule corpuses in a self-supervised manner. In addition to the commonly used MoleculeNet benchmark datasets, we also assembled a suite of opioids-related datasets for downstream prediction evaluation. We first conducted dataset profiling on label distribution and structural analyses; we also examined the activity cliffs issue in the opioids-related datasets. Then, we trained 4,320 predictive models and evaluated the usefulness of the learned representations. Furthermore, we explored into the model evaluation by studying the effect of statistical tests, evaluation metrics and task settings. Finally, we dissected the chemical space generalization into inter-scaffold and intra-scaffold generalization and measured prediction performance to evaluate model generalizbility under both settings. By taking this respite, we reflected on the key aspects underlying molecular property prediction, the awareness of which can, hopefully, bring better AI techniques in this field.
翻译:人工智能(AI)在药物发现中广泛应用,主要任务是分子属性预测。尽管AI技术在分子代表性学习中涌现,但分子属性预测的一些关键方面尚未得到仔细研究。在本研究中,我们对三种具有代表性的模式,即随机森林、摩尔贝特和GROVER进行了系统比较,这三个模型分别使用三种主要的分子表示、延伸连接指纹、SMILES字符串和分子图。特别是MolBERT和GROVER, 以自我监督的方式对大规模未加标签的分子体进行了培训。除了常用的分子间网络基准数据集之外,我们还为下游预测评估收集了一组与类阿片有关的数据集。我们首先对标签分布和结构分析进行了数据集分析;我们还审查了类阿片相关数据集中的活动悬崖问题。然后,我们培训了4,320个反映模型的预测模型,并评估了所了解的表述的效用。此外,我们通过研究统计测试的效果、评价指标和任务间基准数据集,我们从总体预测的角度,将这种总体预测结果分解了总体预测,然后,我们从总体预测,从总体预测到总体预测,从总体预测的角度,从总体预测到总体预测,从中得出了这种预测。