Many scientific problems require identifying a small set of covariates that are associated with a target response and estimating their effects. Often, these effects are nonlinear and include interactions, so linear and additive methods can lead to poor estimation and variable selection. Unfortunately, methods that simultaneously express sparsity, nonlinearity, and interactions are computationally intractable -- with runtime at least quadratic in the number of covariates, and often worse. In the present work, we solve this computational bottleneck. We show that suitable interaction models have a kernel representation, namely there exists a "kernel trick" to perform variable selection and estimation in $O$(# covariates) time. Our resulting fit corresponds to a sparse orthogonal decomposition of the regression function in a Hilbert space (i.e., a functional ANOVA decomposition), where interaction effects represent all variation that cannot be explained by lower-order effects. On a variety of synthetic and real datasets, our approach outperforms existing methods used for large, high-dimensional datasets while remaining competitive (or being orders of magnitude faster) in runtime.
翻译:许多科学问题都要求找出与目标反应相关联并估计其效果的一小组共变体。 这些效应往往非线性,包括互动,因此线性和添加方法可能导致估算和变量选择不周。 不幸的是,同时表达宽度、非线性和互动的方法在计算上是棘手的 -- -- 运行时间至少是共变数的四倍,而且往往更糟。在目前的工作中,我们解决了这一计算瓶颈。我们表明,合适的互动模型有一个内核代表,即存在一个“内核把戏 ”, 用美元(#共变数)时间来进行变量选择和估算。我们产生的匹配与Hilbert空间回归函数(即功能的 ANOVA脱形)的稀薄或纵形分解相匹配,其中互动效应代表了无法用较低顺序效应解释的所有变异。在各种合成和真实数据集中,我们的方法在运行过程中,在保持竞争性(或数量级)的同时,超越了用于大型高维数据集的现有方法。