In biomedical research, to obtain more accurate prediction results from a target study, leveraging information from multiple similar source studies is proved to be useful. However, in many biomedical applications based on real-world data, populations under consideration in different studies, e.g., clinical sites, can be heterogeneous, leading to challenges in properly borrowing information towards the target study. The state of art methods are typically based on study-level matching to identify source studies that are similar to the target study, whilst samples from source studies that significantly differ from the target study will all be dropped at the study level, which can lead to substantial loss of information. We consider a general situation where all studies are sampled from a super-population composed of distinct subpopulations, and propose a novel framework of targeted learning via subpopulation matching. In contrast to the existing study-level matching methods, measuring similarities between subpopulations can effectively decompose both within- and between-study heterogeneity, allowing incorporation of information from all source studies without dropping any samples as in the existing methods. We devise the proposed framework as a two-step procedure, where a finite mixture model is first fitted jointly across all studies to provide subject-wise probabilistic subpopulation information, followed by a step of within-subpopulation information transferring from source studies to the target study for each identified subpopulation. We establish the non-asymptotic properties of our estimator and demonstrate the ability of our method to improve prediction at the target study via simulation studies.
翻译:在生物医学研究中,为从目标研究中获得更准确的预测结果,利用多个相似源研究的信息已被证明是有效的。然而,在许多基于真实世界数据的生物医学应用中,不同研究(如临床中心)所考察的群体可能存在异质性,这给向目标研究恰当地借用信息带来了挑战。现有先进方法通常基于研究层面的匹配来识别与目标研究相似的源研究,而与目标研究差异显著的源研究样本将在研究层面被全部舍弃,这可能导致信息的严重损失。我们考虑一种普遍情况:所有研究均采样自由不同子群体构成的超群体,并提出一种基于子群体匹配的靶向学习新框架。与现有研究层面匹配方法相比,通过测量子群体间的相似性可有效分解研究内和研究间的异质性,从而能够纳入所有源研究的信息,而无需像现有方法那样舍弃任何样本。我们将所提框架设计为两步流程:首先通过跨所有研究的联合拟合有限混合模型提供个体层面的概率子群体信息,随后针对每个已识别的子群体执行从源研究向目标研究的子群体内信息迁移。我们建立了估计量的非渐近性质,并通过模拟研究证明了本方法在提升目标研究预测能力方面的有效性。