A critical decision point when training predictors using multiple studies is whether these studies should be combined or treated separately. We compare two multi-study learning approaches in the presence of potential heterogeneity in predictor-outcome relationships across datasets. We consider 1) merging all of the datasets and training a single learner, and 2) multi-study ensembling, which involves training a separate learner on each dataset and combining the predictions resulting from each learner. In a linear regression setting, we show analytically and confirm via simulation that merging yields lower prediction error than ensembling when the predictor-outcome relationships are relatively homogeneous across studies. However, as cross-study heterogeneity increases, there exists a transition point beyond which ensembling outperforms merging. We provide analytic expressions for the transition point in various scenarios, study asymptotic properties, and illustrate how transition point theory can be used for deciding when studies should be combined with an application from metabolomics.
翻译:在使用多种研究的培训预测人员时,一个关键的决定点是这些研究是否应该合并或分开处理。我们比较了两个多研究学习方法,因为各数据集之间在预测和结果关系中可能存在差异。我们考虑(1) 将所有数据集合并并培训一名单一学习者,(2) 多研究组合,这涉及对每个数据集的单独学习者进行培训,并将每个学习者的预测结果合并在一起。在线性回归环境中,我们通过模拟进行分析和确认,在预测和结果关系相对相同的研究中,合并的预测错误比混合的错误要低。然而,随着交叉研究的异质性增加,还存在一个过渡点,超越这一过渡点,将外形组合在一起。我们为各种情景的过渡点提供了分析表达,研究无症状特性,并说明了在决定何时研究应当与代谢学应用相结合时如何使用过渡点理论。