The input of almost every machine learning algorithm targeting the properties of matter at the atomic scale involves a transformation of the list of Cartesian atomic coordinates into a more symmetric representation. Many of these most popular representations can be seen as an expansion of the symmetrized correlations of the atom density, and differ mainly by the choice of basis. Here we discuss how to build an adaptive, optimal numerical basis that is chosen to represent most efficiently the structural diversity of the dataset at hand. For each training dataset, this optimal basis is unique, and can be computed at no additional cost with respect to the primitive basis by approximating it with splines. We demonstrate that this construction yields representations that are accurate and computationally efficient, presenting examples that involve both molecular and condensed-phase machine-learning models.
翻译:几乎每个针对原子规模物质特性的机器学习算法都输入了几乎每一个针对原子规模物质特性的机器学习算法,这些算法都涉及将笛卡尔原子坐标表转换为更对称的表示法,其中许多最受欢迎的表示法可视为原子密度的对称相关性的扩大,主要因依据的选择而不同。我们在这里讨论如何建立一个适应性、最佳的数字基础,选择该基础最高效地代表手头数据集的结构多样性。对于每个培训数据集来说,这一最佳基础是独特的,并且可以通过与样条相近的方式,在不增加原始基础成本的情况下进行计算。我们证明,这一构造的表示法具有准确性和计算效率,提供了涉及分子和浓缩阶段机学习模型的例子。