In this work, we propose a semi-supervised triply robust inductive transfer learning (STRIFLE) approach, which integrates heterogeneous data from label rich source population and label scarce target population to improve the learning accuracy in the target population. Specifically, we consider a high dimensional covariate shift setting and employ two nuisance models, a density ratio model and an imputation model, to combine transfer learning and surrogate-assisted semi-supervised learning strategies organically and achieve triple robustness. While the STRIFLE approach requires the target and source populations to share the same conditional distribution of outcome Y given both the surrogate features S and predictors X, it allows the true underlying model of Y|X to differ between the two populations due to the potential covariate shift in S and X. Different from double robustness, even if both nuisance models are misspecified or the distribution of Y|S,X is not the same between the two populations, when the transferred source population and the target population share enough similarities, the triply robust STRIFLE estimator can still partially utilize the source population, and it is guaranteed to be no worse than the target-only surrogate-assisted semi-supervised estimator with negligible errors. These desirable properties of our estimator are established theoretically and verified in finite-sample via extensive simulation studies. We utilize the STRIFLE estimator to train a Type II diabetes polygenic risk prediction model for the African American target population by transferring knowledge from electronic health records linked genomic data observed in a larger European source population.
翻译:在这项工作中,我们建议采用半监督的三维稳健的感官转移学习方法(STRIFLE),将标签上富源人口和标签上稀缺的目标人口的数据混在一起,以提高目标人口的学习准确性。具体地说,我们考虑采用高维共变换变换变换设置,并采用两种骚扰模式,即密度比率模型和估算模型,将转移学习和代用辅助半监督学习战略有机地结合起来,实现三级稳健。虽然STRIFLE方法要求目标和来源人口共享相同的结果Y的有条件分布,同时考虑到代理数据S和预测数据X,它允许Y ⁇ X的真正基本模型因S和X潜在的共变换变化而不同。 与双重稳健模式不同,即使两种模式都定义错误或Y ⁇ S的分布不当,X在两个人群之间并不相同,当转移源人口和目标人口具有足够的相似性时,我们转移源和目标人群的STRIFLE 电子估测算器仍然部分利用源人口,并且保证通过智能数据比亚性数据更差。