High-dimensional prediction with multiple data types needs to account for potentially strong differences in predictive signal. Ridge regression is a simple model for high-dimensional data that has challenged the predictive performance of many more complex models and learners, and that allows inclusion of data type specific penalties. The largest challenge for multi-penalty ridge is to optimize these penalties efficiently in a cross-validation (CV) setting, in particular for GLM and Cox ridge regression, which require an additional estimation loop by iterative weighted least squares (IWLS). Our main contribution is a computationally very efficient formula for the multi-penalty, sample-weighted hat-matrix, as used in the IWLS algorithm. As a result, nearly all computations are in low-dimensional space, rendering a speed-up of several orders of magnitude. We developed a flexible framework that facilitates multiple types of response, unpenalized covariates, several performance criteria and repeated CV. Extensions to paired and preferential data types are included and illustrated on several cancer genomics survival prediction problems. Moreover, we present similar computational shortcuts for maximum marginal likelihood and Bayesian probit regression. The corresponding R-package, multiridge, serves as a versatile standalone tool, but also as a fast benchmark for other more complex models and multi-view learners.
翻译:具有多种数据类型的高尺度预测需要考虑到预测信号中潜在的巨大差异。 山脊回归是高维数据的简单模型,它挑战了许多更复杂的模型和学习者的预测性能,并允许纳入数据类型的特定惩罚。 多锥脊的最大挑战是在交叉校验(CV)环境中高效优化这些处罚,特别是GLM和Cox脊回归,这需要用迭接加权最小平方(IWLS)进行额外的估计循环。我们的主要贡献是计算出一个高效的多锥体、抽样加权的帽子矩阵公式,这在IWLS算法中使用。因此,几乎所有的计算方法都位于低维空间,使几个数量级的加速。我们开发了一个灵活的框架,促进多种类型的反应、无依赖的共变异性、若干性标准以及重复的CV。 配对和特准数据类型的扩展,并演示了几个癌症基因组生存预测问题。 此外,我们提出了类似的计算捷径捷捷捷捷捷的捷径,作为最边缘、多基的模型,也作为其他标准。