用于预测原子神经网络的不确定性量化 (Uncertainty quantification for predictions of atomistic neural networks)

The value of uncertainty quantification on predictions for trained neural networks (NNs) on quantum chemical reference data is quantitatively explored. For this, the architecture of the PhysNet NN was suitably modified and the resulting model was evaluated with different metrics to quantify calibration, quality of predictions, and whether prediction error and the predicted uncertainty can be correlated. The results from training on the QM9 database and evaluating data from the test set within and outside the distribution indicate that error and uncertainty are not linearly related. The results clarify that noise and redundancy complicate property prediction for molecules even in cases for which changes - e.g. double bond migration in two otherwise identical molecules - are small. The model was then applied to a real database of tautomerization reactions. Analysis of the distance between members in feature space combined with other parameters shows that redundant information in the training dataset can lead to large variances and small errors whereas the presence of similar but unspecific information returns large errors but small variances. This was, e.g., observed for nitro-containing aliphatic chains for which predictions were difficult although the training set contained several examples for nitro groups bound to aromatic molecules. This underlines the importance of the composition of the training data and provides chemical insight into how this affects the prediction capabilities of a ML model. Finally, the approach put forward can be used for information-based improvement of chemical databases for target applications through active learning optimization.

翻译：对数量化学参考数据中经过培训的神经神经网络预测的不确定性量化值进行了定量探讨。为此,对PhysNet NNN的架构进行了适当修改,并用不同的指标对模型进行了评价,以量化校准、预测质量以及预测错误和预测的不确定性是否相关。对QM9数据库的培训结果和对分布范围内外测试数据集数据的评价结果显示,错误和不确定性没有线性联系。结果澄清,噪音和冗余使分子的属性预测复杂化,即使在变化较小的情况下,分子的属性预测也比较困难,例如两个本来相同的分子的双联结迁移规模很小。然后,模型应用模型应用到一个真正的软化反应数据库中。对地貌空间成员之间的距离与其他参数的分析表明,培训数据集中的多余信息可能导致很大的差异和小的错误,而类似但并不具体的信息的出现很大的错误,但基础差异较小。例如,对含有静态链的分子的属性预测是困难的,尽管培训集集包含一些关于化学前方预测的精细度方法的例子,而这一培训集则是用于化学前方预测的精度数据组合的精细度分析,从而将这一化学的精细化方法成为了对化学分子的精度分析方法的精度分析基础。