Including pairwise interactions between the predictors of a regression model can produce better predicting models. However, to fit such interaction models on typical data sets in biology and other fields can often require solving enormous variable selection problems with billions of interactions. The scale of such problems demands methods that are computationally cheap (both in time and memory) yet still have sound statistical properties. Motivated by these large-scale problem sizes, we adopt a very simple guiding principle: One should prefer main effects over interactions if all else is equal. This "reluctance" to interactions, while reminiscent of the hierarchy principle for interactions, is much less restrictive. We design a computationally efficient method built upon this principle and provide theoretical results indicating favorable statistical properties. Empirical results show dramatic computational improvement without sacrificing statistical properties. For example, the proposed method can solve a problem with 10 billion interactions with 5-fold cross-validation in under 7 hours on a single CPU.
翻译:包含回归模型预测者之间对等的相互作用可以产生更好的预测模型。 但是,为了将这种相互作用模型适用于生物学和其他领域的典型数据集,往往需要用数十亿个相互作用解决巨大的变量选择问题。 这些问题的规模要求采用计算成本低的方法(时间和记忆),但统计属性仍然健全。 受这些大规模问题大小的驱动,我们采用了一个非常简单的指导原则:如果所有其他因素都相同,则应该选择主要影响而不是互动。 这种“差异”与互动相对应的“差异”远远不那么具有限制性。 我们设计了一种基于此原则的计算高效方法,并提供理论结果,表明有利的统计属性。 经验性结果显示,在不牺牲统计属性的情况下,计算效果显著改善。 例如,拟议的方法可以在单一的CPU上用不到7小时的时间解决100亿个相互作用与5倍交叉校验的问题。