Variable selection is central to high-dimensional data analysis, and various algorithms have been developed. Ideally, a variable selection algorithm shall be flexible, scalable, and with theoretical guarantee, yet most existing algorithms cannot attain these properties at the same time. In this article, a three-step variable selection algorithm is developed, involving kernel-based estimation of the regression function and its gradient functions as well as a hard thresholding. Its key advantage is that it assumes no explicit model assumption, admits general predictor effects, allows for scalable computation, and attains desirable asymptotic sparsistency. The proposed algorithm can be adapted to any reproducing kernel Hilbert space (RKHS) with different kernel functions, and can be extended to interaction selection with slight modification. Its computational cost is only linear in the data dimension, and can be further improved through parallel computing. The sparsistency of the proposed algorithm is established for general RKHS under mild conditions, including linear and Gaussian kernels as special cases. Its effectiveness is also supported by a variety of simulated and real examples.
翻译:变量选择是高维数据分析的核心,并且已经开发了各种算法。 理想的情况是,变量选择算法应该是灵活、可缩放的,并且有理论保证,但大多数现有算法不能同时实现这些属性。 在本条中,开发了三步变量选择算法,包括对回归函数及其梯度函数以及硬阈值进行内核估计。它的关键优势在于它没有假设明确的模型假设,没有接受一般预测效应,允许可缩放的计算,并达到理想的零散状态。 拟议的算法可以被调整为具有不同内核功能的任何再生产内核希尔伯特空间(RKHS),也可以扩展为互动选择而稍作修改。 它的计算成本在数据方面只是线性,并且可以通过平行计算得到进一步的改进。 所拟议的算法的广度是在温和的条件下为一般的RKHS系统设定的,包括线形和高阶内核作为特例。 其有效性还得到了各种模拟和真实的例子的支持。