Uncertainty control and scalability to large datasets are the two main issues for the deployment of Gaussian process models into the autonomous material and chemical space exploration pipelines. One way to address both of these issues is by introducing the latent inducing variables and choosing the right approximation for the marginal log-likelihood objective. Here, we show that variational learning of the inducing points in the high-dimensional molecular descriptor space significantly improves both the prediction quality and uncertainty estimates on test configurations from a sample molecular dynamics dataset. Additionally, we show that inducing points can learn to represent the configurations of the molecules of different types that were not present within the initialization set of inducing points. Among several evaluated approximate marginal log-likelihood objectives, we show that the predictive log-likelihood provides both the predictive quality comparable to the exact Gaussian process model and excellent uncertainty control. Finally, we comment on whether a machine learning model makes predictions by interpolating the molecular configurations in high-dimensional descriptor space. We show that despite our intuition, and even for densely sampled molecular dynamics datasets, most of the predictions are done in the extrapolation regime.
翻译:对大型数据集的不确定性控制和可变性是将高斯过程模型应用于自主材料和化学空间探索管道的两个主要问题。解决这两个问题的一种方法是引入潜在诱导变量,并为边际日志相似性目标选择正确的近似值。在这里,我们表明,对高维分子分子描述空间的诱导点进行不同的学习,大大提高了从样本分子动态数据集对测试配置的预测质量和不确定性估计值。此外,我们表明,诱导点可以学习代表不同类型分子的配置,而这些分子在初始化导点组中没有出现。在几个经过评估的近似边际日志相似性目标中,我们表明,预测日志相似性提供了可与精确的高斯进程模型相比的预测质量和极好的不确定性控制。最后,我们评论的是,一个机器学习模型是否通过对高度分子描述空间的分子配置进行内插,从而作出预测。我们表明,尽管我们的直觉,甚至对于核心分子动态系统,大多数的预测都是在额外预测中完成的。