Bayesian子子选择和可解释的预测和分类的可变重要性 (Bayesian subset selection and variable importance for interpretable prediction and classification)

Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often avoided due to selection instability, lack of regularization, and difficulties with post-selection inference. We address these challenges from a Bayesian perspective. Given any Bayesian predictive model $\mathcal{M}$, we extract a family of near-optimal subsets of variables for linear prediction or classification. This strategy deemphasizes the role of a single "best" subset and instead advances the broader perspective that often many subsets are highly competitive. The acceptable family of subsets offers a new pathway for model interpretation and is neatly summarized by key members such as the smallest acceptable subset, along with new (co-) variable importance metrics based on whether variables (co-) appear in all, some, or no acceptable subsets. More broadly, we apply Bayesian decision analysis to derive the optimal linear coefficients for any subset of variables. These coefficients inherit both regularization and predictive uncertainty quantification via $\mathcal{M}$. For both simulated and real data, the proposed approach exhibits better prediction, interval estimation, and variable selection than competing Bayesian and frequentist selection methods. These tools are applied to a large education dataset with highly correlated covariates. Our analysis provides unique insights into the combination of environmental, socioeconomic, and demographic factors that predict educational outcomes, and identifies over 200 distinct subsets of variables that offer near-optimal out-of-sample predictive accuracy.

翻译：子集选择是可解释的学习、科学发现和数据压缩的宝贵工具。然而, 经典子集选择通常会因选择不稳定、缺乏正规化和选后推论的困难而避免。我们从巴伊西亚的角度来应对这些挑战。鉴于任何巴伊西亚预测模型 $\ mathcal{M} $, 我们从任何贝伊西亚预测模型中提取一组近于最佳的变量子集, 用于线性预测或分类。这个战略淡化了单一“ 最佳”子集的作用, 反而推进了通常许多子集具有高度竞争力的更广泛视角。可接受的子集群群提供了模型解释的新途径, 并且由最小的可接受子集( 共) 和后选法。拟议的子集子集群选择法( 共) 以及新的( 共) 可变重要度指标基于变量是否出现在全部、部分或无可接受子集。更广义地, 我们运用的贝亚值决定分析, 对任何子组变量的精度系数可以取代正规和预测值量化的计算。。对于模拟数据选择方法而言,, 这些可比较的可比较的可变化的可变化方法可以提供比亚化的可变化的可变化的可变性数据, 。