We develop a simple and unified framework for nonlinear variable selection that incorporates uncertainty in the prediction function and is compatible with a wide range of machine learning models (e.g., tree ensembles, kernel methods, neural networks, etc). In particular, for a learned nonlinear model $f(\mathbf{x})$, we consider quantifying the importance of an input variable $\mathbf{x}^j$ using the integrated partial derivative $\Psi_j = \Vert \frac{\partial}{\partial \mathbf{x}^j} f(\mathbf{x})\Vert^2_{P_\mathcal{X}}$. We then (1) provide a principled approach for quantifying variable selection uncertainty by deriving its posterior distribution, and (2) show that the approach is generalizable even to non-differentiable models such as tree ensembles. Rigorous Bayesian nonparametric theorems are derived to guarantee the posterior consistency and asymptotic uncertainty of the proposed approach. Extensive simulations and experiments on healthcare benchmark datasets confirm that the proposed algorithm outperforms existing classic and recent variable selection methods.
翻译:我们为非线性变量选择开发了一个简单和统一的框架,该框架将预测功能的不确定性纳入其中,并与一系列广泛的机器学习模型(例如树群、内核方法、神经网络等)兼容。特别是,对于一个学得的非线性模型$f(\mathbf{x}),我们考虑用一个输入变量$\mathbf{x ⁇ j$的量化重要性,使用一个整合的局部衍生物$\Psi_j=\Vert\frac-party\mathf{x}f(mathbf{x})\Vert\craf{x}f(mathb{x})\Vert%2 ⁇ _P ⁇ mathcal{X}。然后,我们提供了一种原则性的方法,通过推断其后表分布来量化可变选择不确定性;以及(2)表明,这一方法甚至可普遍适用于不差别的模型,例如树团。 精密的巴伊斯非参数性非参数用于保证后表的一致性,并证实所拟议的传统方法的最近测试,从而确认了目前采用的方法。