In the emerging field of materials informatics, a fundamental task is to identify physicochemically meaningful descriptors, or materials genes, which are engineered from primary variables and a set of elementary algebraic operators through compositions. Standard practice directly analyzes the high-dimensional candidate predictor space in a linear model; statistical analyses are then substantially hampered by the daunting challenge posed by the astronomically large number of correlated predictors with limited sample size. We formulate this problem as variable selection with operator-induced structure (OIS), and propose a new method to achieve unconventional dimension reduction by utilizing the geometry embedded in OIS. Although the model remains linear, we iterate nonparametric variable selection for effective dimension reduction. This enables variable selection based on ab initio primary variables, leading to a method that is orders of magnitude faster than existing methods, with improved accuracy. An OIS screening property for variable selection with OIS is introduced; interestingly, finite sample assessment indicates that the employed Bayesian Additive Regression Trees (BART)-based variable selection method enjoys this property under the simulation settings. Numerical studies show the superiority of the proposed method, which continues to exhibit robust performance when the dimension of engineered features is out of reach of existing methods. Our analysis to single-atom catalysis identifies physical descriptors that explain the binding energy of metal-support pairs with high explanatory power, leading to interpretable insights to guide the prevention of a notorious problem called sintering and aid catalysis design.
翻译:在材料信息学的新兴领域,一项基本任务是查明物理化学意义上有意义的代词仪或材料基因,这些由初级变量和一组基本代数操作员通过组成来设计。标准做法直接分析线性模型中的高维候选预测空间;然后统计分析因大量相关样本规模有限的相关预测器构成的巨大挑战而受到极大阻碍。我们将这一问题作为操作者引发的结构(OIS)的变量选择提出,并提议一种新的方法,通过使用OIS嵌入的几何方法,实现非常规层面的减少。虽然模型仍然线性,但我们反复设计非参数变量选择,以有效减少维度。这样,标准做法可以直接分析线性模型的高级候选预测空间;从而导致一种比现有方法规模更快的方法,从而提高准确性。我们采用了一种OIS筛选变量属性;有趣的有限抽样评估表明,在模拟环境中使用的Bayesian Additive Regilation 树(BART) 的变量选择方法继续使用这一属性。我们进行模拟环境下的稳妥性精确性分析,而现在的深度分析则显示一种稳定的金属解释方法的高级性分析。