Machine-learned interatomic potentials (MLIPs) and force fields (i.e. interaction laws for atoms and molecules) are typically trained on limited data-sets that cover only a very small section of the full space of possible input structures. MLIPs are nevertheless capable of making accurate predictions of forces and energies in simulations involving (seemingly) much more complex structures. In this article we propose a framework within which this kind of generalisation can be rigorously understood. As a prototypical example, we apply the framework to the case of simulating point defects in a crystalline solid. Here, we demonstrate how the accuracy of the simulation depends explicitly on the size of the training structures, on the kind of observations (e.g., energies, forces, force constants, virials) to which the model has been fitted, and on the fit accuracy. The new theoretical insights we gain partially justify current best practices in the MLIP literature and in addition suggest a new approach to the collection of training data and the design of loss functions.
翻译:机械学的跨原子潜力(MLIPs)和力场(即原子和分子的相互作用法)一般都接受有限的数据集培训,这些数据集只覆盖可能输入结构全部空间的很小一部分。但MLIPs仍然能够在模拟(似乎)涉及更复杂的结构时准确预测出力量和能量。在本篇文章中,我们提出了一个框架,在这个框架内可以严格理解这种概括性。作为一个典型的例子,我们将框架应用于晶体固体中模拟点缺陷的情况。在这里,我们证明模拟的准确性如何明确取决于培训结构的规模、模型已经安装的观察类型(例如能源、力量、常量、病毒)以及准确性。我们从新的理论角度了解到,我们部分地证明目前在MLIP文献中的最佳做法是正确的,此外,我们还建议采用新的方法来收集培训数据和设计损失功能。