Eficient, physically-inspired descriptors of the structure and composition of molecules and materials play a key role in the application of machine-learning techniques to atomistic simulations. The proliferation of approaches, as well as the fact that each choice of features can lead to very different behavior depending on how they are used, e.g. by introducing non-linear kernels and non-Euclidean metrics to manipulate them, makes it difficult to objectively compare different methods, and to address fundamental questions on how one feature space is related to another. In this work we introduce a framework to compare different sets of descriptors, and different ways of transforming them by means of metrics and kernels, in terms of the structure of the feature space that they induce. We define diagnostic tools to determine whether alternative feature spaces contain equivalent amounts of information, and whether the common information is substantially distorted when going from one feature space to another. We compare, in particular, representations that are built in terms of n-body correlations of the atom density, quantitatively assessing the information loss associated with the use of low-order features. We also investigate the impact of different choices of basis functions and hyperparameters of the widely used SOAP and Behler-Parrinello features, and investigate how the use of non-linear kernels, and of a Wasserstein-type metric, change the structure of the feature space in comparison to a simpler linear feature space.
翻译:分子和材料的结构和构成的理论和物理启发说明器在应用机器教学技术进行原子模拟方面发挥着关键作用。方法的激增,以及根据特性空间的结构,每种特性的选择都可能导致非常不同的行为,这取决于如何使用这些特性,例如采用非线性内核和非线性内核,以及非单线性指标来操纵这些特性,因此难以客观地比较不同的方法,也难以解决关于某一特性空间与另一个特性之间如何联系的基本问题。在这项工作中,我们引入了一个框架,以比较不同的描述器和不同方法,以及用测量器和内核来改变这些特性的不同方法。我们界定了诊断工具,以确定替代特性空间是否包含同等数量的信息,以及共同信息从一个特性空间到另一个特性是否严重扭曲。我们特别比较了以原子密度的纯度相关性、定量地评估与使用低序特性相关的信息损失,我们还调查了不同特性选择的SO-P型空间特性,并广泛使用SO-RO-RO-RO-RO-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-