Many scientific problems require identifying a small set of covariates that are associated with a target response and estimating their effects. Often, these effects are nonlinear and include interactions, so linear and additive methods can lead to poor estimation and variable selection. Unfortunately, methods that simultaneously express sparsity, nonlinearity, and interactions are computationally intractable -- with runtime at least quadratic in the number of covariates, and often worse. In the present work, we solve this computational bottleneck. We show that suitable interaction models have a kernel representation, namely there exists a "kernel trick" to perform variable selection and estimation in $O$(# covariates) time. Our resulting fit corresponds to a sparse orthogonal decomposition of the regression function in a Hilbert space (i.e., a functional ANOVA decomposition), where interaction effects represent all variation that cannot be explained by lower-order effects. On a variety of synthetic and real data sets, our approach outperforms existing methods used for large, high-dimensional data sets while remaining competitive (or being orders of magnitude faster) in runtime.
翻译:许多科学问题都要求找出与目标反应相关联并估计其效果的一小组共变体。 这些效应往往非线性,包括互动,因此线性和添加方法可能导致估算和变量选择不周。 不幸的是,同时表达宽度、非线性和互动的方法在计算上是棘手的 -- -- 运行时间至少是共变数的四倍,而且往往更糟。在目前的工作中,我们解决了这一计算瓶颈。我们表明,合适的互动模型有一个内核代表,即存在一种“内核把戏”来进行变量选择和估算,在美元(#共变数)时间里进行。我们产生的匹配与Hilbert空间回归函数的稀疏或纵形分解(即功能的 ANOVA脱形)相对应,因为相互作用效应代表着所有变化,不能用较低顺序效应来解释。在各种合成和真实数据集中,我们的方法超越了用于大型高维数据集的现有方法,同时在运行时保持竞争性(或数量级)的速度更快。