Transfer learning aims to improve the performance of a target model by leveraging data from related source populations. It is known to be especially helpful in cases with insufficient target data. In this paper, we study the problem of how to train a high-dimensional ridge regression model with limited target data and existing models trained in heterogeneous source populations. We consider a practical setting where only the source model parameters are accessible, instead of the individual-level source data. Under the setting with only one source model, we propose a novel flexible angle-based transfer learning (angleTL) method, which leverages the concordance between the source and the target model parameters. We show that angleTL unifies several benchmark methods by construction, including the target-only model trained using target data alone, the source model trained using the source data, and the distance-based transfer learning method that incorporates the source model to the target training by penalizing the difference between the target and source model parameters measured by the $L_2$ norm. We also provide algorithms to effectively incorporate multiple source models accounting for the fact that some source models may be more helpful than others. Our high-dimensional asymptotic analysis provides interpretations and insights regarding when a source model can be helpful to the target model, and demonstrates the superiority of angleTL over other benchmark methods. We perform extensive simulation studies to validate our theoretical conclusions and show the feasibility of applying angleTL to transfer existing genetic risk prediction models across multiple biobanks.
翻译:转让学习的目的是通过利用相关源人口的数据来改进目标模型的性能,已知这种方法在目标数据不足的情况下特别有用。在本文件中,我们研究了如何培训高维脊回归模型的问题,该模型的目标数据有限,而现有模型在不同的源人口方面受过培训。我们考虑一个实际的设置,即只有源模型参数是可以获得的,而不是个人源数据。在仅使用一个源模型的设置下,我们建议采用一种新的灵活角度转移学习(gangle TL)方法,利用源和目标模型参数之间的一致。我们表明,角度TL通过构建统一了几种基准方法,包括仅使用目标数据而培训的、仅目标型山脊回归模型、使用源数据培训的、远程转移学习方法,将源模型纳入目标培训,同时对以$L2美元标准衡量的目标和源模型参数之间的差异进行处罚。我们还提供算法,以便有效地纳入多种源模型,即某些源模型可能比其他模型更有帮助。我们的高维度模型分析提供了高端的模型,在对广泛的源数据结论上展示了高层次的理论性分析,并展示了我们现有的基准评估方法。