We propose a new method for variable selection with operator-induced structure (OIS), in which the predictors are engineered from a limited number of primary variables and a set of elementary algebraic operators through compositions. Standard practice directly analyzes the high-dimensional candidate predictor space in a linear model; statistical analyses are then substantially hampered by the daunting challenge posed by millions of correlated predictors with limited sample size. The proposed method iterates nonparametric variable selection to achieve effective dimension reduction in linear models by utilizing the geometry embedded in OIS. This enables variable selection based on \textit{ab initio} primary variables, leading to a method that is orders of magnitude faster than existing methods, with improved accuracy. The proposed method is well suited for areas that adopt feature engineering and emphasize interpretability in addition to prediction, such as the emerging field of materials informatics. We demonstrate the superior performance of the proposed method in simulation studies and a real data application to single-atom catalyst analysis. An OIS screening property for variable selection methods in the presence of feature engineering is introduced; interestingly, finite sample assessment indicates that the employed Bayesian Additive Regression Trees (BART)-based variable selection method enjoys this property. Our numerical experiments show that the proposed method exhibits robust performance when the dimension of engineered features is out of reach of existing methods.
翻译:我们提议了一种与操作者引起的结构(OIS)进行变量选择的新方法,在这种方法中,预测器是从数量有限的初级变量和一组基本的代数操作器通过组成来设计出来的,标准做法直接分析线性模型中的高维候选预测空间;然后统计分析因数以百万计的相关样本规模有限的预测器构成的艰巨挑战而受到很大阻碍。拟议方法通过利用OIS内嵌的几何测量方法,将非参数变量选择结果转化为线性模型的有效尺寸减少。这样可以根据主要变量来设计变量,从而导致一种比现有方法更快的规模级的方法,并提高其准确性。拟议方法非常适合采用特征工程和强调可解释性的地区,例如新兴材料信息学领域。我们展示了拟议的模拟研究方法和将实际数据应用于单原子催化剂分析的优异性功能。OIS在存在特征工程时,对变量选择方法的属性进行了筛选;有趣的是,有限的抽样评估表明,在采用巴伊西亚-Additi模型模型时,采用这种可变性分析方法的可变性研究方法,以我们现有的可变性研究方法展示了现有的可变性研究方法。