DIONYSUS低数据化学数据集概率模型的校准和通用性 (Calibration and generalizability of probabilistic models on low-data chemical datasets with DIONYSUS)

Deep learning models that leverage large datasets are often the state of the art for modelling molecular properties. When the datasets are smaller (< 2000 molecules), it is not clear that deep learning approaches are the right modelling tool. In this work we perform an extensive study of the calibration and generalizability of probabilistic machine learning models on small chemical datasets. Using different molecular representations and models, we analyse the quality of their predictions and uncertainties in a variety of tasks (binary, regression) and datasets. We also introduce two simulated experiments that evaluate their performance: (1) Bayesian optimization guided molecular design, (2) inference on out-of-distribution data via ablated cluster splits. We offer practical insights into model and feature choice for modelling small chemical datasets, a common scenario in new chemical experiments. We have packaged our analysis into the DIONYSUS repository, which is open sourced to aid in reproducibility and extension to new datasets.

翻译：利用大型数据集的深层学习模型往往是模拟分子特性的先进水平。当数据集较小( < 2000分子)时,尚不清楚深层学习方法是否是正确的建模工具。在这项工作中,我们对小型化学数据集的概率机器学习模型的校准和可概括性进行了广泛研究。我们利用不同的分子表象和模型,分析其预测质量以及各种任务(二元、回归)和数据集的不确定性。我们还引入了两个模拟实验,评估其性能:(1) 贝叶斯优化制导分子设计,(2) 通过宽放集集分解对分配数据外的推断。我们为模拟小型化学数据集的模型和特征选择提供了实用的洞察力,这是新的化学实验中的一种常见情景。我们已经将我们的分析包在DIONYSUS仓库中,该仓库是开放的来源,可以帮助重新预测和扩展到新的数据集。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【干货书】工程和科学中的概率和统计，

专知会员服务

58+阅读 · 2022年12月24日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日