The input of almost every machine learning algorithm targeting the properties of matter at the atomic scale involves a transformation of the list of Cartesian atomic coordinates into a more symmetric representation. Many of the most popular representations can be seen as an expansion of the symmetrized correlations of the atom density, and differ mainly by the choice of basis. Considerable effort has been dedicated to the optimization of the basis set, typically driven by heuristic considerations on the behavior of the regression target. Here we take a different, unsupervised viewpoint, aiming to determine the basis that encodes in the most compact way possible the structural information that is relevant for the dataset at hand. For each training dataset and number of basis functions, one can determine a unique basis that is optimal in this sense, and can be computed at no additional cost with respect to the primitive basis by approximating it with splines. We demonstrate that this construction yields representations that are accurate and computationally efficient, particularly when constructing representations that correspond to high-body order correlations. We present examples that involve both molecular and condensed-phase machine-learning models.
翻译:几乎每个针对原子规模物质特性的机器学习算法都输入了几乎所有针对原子规模物质特性的机器学习算法,这需要将笛卡尔原子坐标列表转换为更对称的表达方式。许多最受欢迎的表达方式可以被视为原子密度的对称相关性的扩大,主要因基础的选择而不同。已作出相当大的努力优化基础集,通常受回归目标行为偏重性考虑的驱动。我们在这里采取不同的、不受监督的观点,目的是确定以最紧凑的方式编码与手头数据集相关的结构信息的基础。对于每一个培训数据集和基础函数的数目,人们可以确定一个独特的基础,在这个意义上是最佳的,并且可以通过与螺旋线相近来计算原始基础,不增加成本。我们证明,这种构造产生准确和计算效率的表达方式,特别是在构建与高体顺序相对应的表达方式时。我们举的例子涉及分子和精密的机器学习模型。