Deep learning models that leverage large datasets are often the state of the art for modelling molecular properties. When the datasets are smaller (< 2000 molecules), it is not clear that deep learning approaches are the right modelling tool. In this work we perform an extensive study of the calibration and generalizability of probabilistic machine learning models on small chemical datasets. Using different molecular representations and models, we analyse the quality of their predictions and uncertainties in a variety of tasks (binary, regression) and datasets. We also introduce two simulated experiments that evaluate their performance: (1) Bayesian optimization guided molecular design, (2) inference on out-of-distribution data via ablated cluster splits. We offer practical insights into model and feature choice for modelling small chemical datasets, a common scenario in new chemical experiments. We have packaged our analysis into the DIONYSUS repository, which is open sourced to aid in reproducibility and extension to new datasets.
翻译:利用大型数据集的深层学习模型往往是模拟分子特性的先进水平。当数据集较小( < 2000分子)时,尚不清楚深层学习方法是否是正确的建模工具。在这项工作中,我们对小型化学数据集的概率机器学习模型的校准和可概括性进行了广泛研究。我们利用不同的分子表象和模型,分析其预测质量以及各种任务(二元、回归)和数据集的不确定性。我们还引入了两个模拟实验,评估其性能:(1) 贝叶斯优化制导分子设计,(2) 通过宽放集集分解对分配数据外的推断。我们为模拟小型化学数据集的模型和特征选择提供了实用的洞察力,这是新的化学实验中的一种常见情景。我们已经将我们的分析包在DIONYSUS仓库中,该仓库是开放的来源,可以帮助重新预测和扩展到新的数据集。