This paper develops a new framework, called modular regression, to utilize auxiliary information -- such as variables other than the original features or additional data sets -- in the training process of linear models. At a high level, our method follows the routine: (i) decomposing the regression task into several sub-tasks, (ii) fitting the sub-task models, and (iii) using the sub-task models to provide an improved estimate for the original regression problem. This routine applies to widely-used low-dimensional (generalized) linear models and high-dimensional regularized linear regression. It also naturally extends to missing-data settings where only partial observations are available. By incorporating auxiliary information, our approach improves the estimation efficiency and prediction accuracy compared to linear regression or the Lasso under a conditional independence assumption. For high-dimensional settings, we develop an extension of our procedure that is robust to violations of the conditional independence assumption, in the sense that it improves efficiency if this assumption holds and coincides with the Lasso otherwise. We demonstrate the efficacy of our methods with both simulated and real data sets.
翻译:本文开发了一个新的框架,称为模块回归,以在线性模型的培训过程中利用辅助信息 -- -- 例如原始特征或额外数据集以外的变量 -- -- 在线性模型的培训过程中使用辅助信息。 在高层次上,我们的方法遵循常规:(一) 将回归任务分解成几个子任务,(二) 将子任务模型相配,(三) 使用子任务模型为原始回归问题提供更好的估计值。这一常规适用于广泛使用的低维(通用)线性模型和高维常规线性线性回归。它自然也延伸到只有部分观察的缺失数据设置。通过纳入辅助信息,我们的方法提高了估算效率和预测准确性,与线性回归或有条件独立假设下的拉索相比。对于高维环境,我们开发了一种程序扩展,以强于违反有条件独立假设,也就是说,如果这一假设与Lasso假设保持并重合,则会提高效率。我们用模拟和真实数据集展示了我们方法的功效。