It is increasingly common to encounter prediction tasks in the biomedical sciences for which multiple datasets are available for model training. Common approaches such as pooling datasets and applying standard statistical learning methods can result in poor out-of-study prediction performance when datasets are heterogeneous. Theoretical and applied work has shown $\textit{multi-study ensembling}$ to be a viable alternative that leverages the variability across datasets in a manner that promotes model generalizability. Multi-study ensembling uses a two-stage $\textit{stacking}$ strategy which fits study-specific models and estimates ensemble weights separately. This approach ignores, however, the ensemble properties at the model-fitting stage, potentially resulting in a loss of efficiency. We therefore propose $\textit{optimal ensemble construction}$, an $\textit{all-in-one}$ approach to multi-study stacking whereby we jointly estimate ensemble weights as well as parameters associated with each study-specific model. We prove that limiting cases of our approach yield existing methods such as multi-study stacking and pooling datasets before model fitting. We propose an efficient block coordinate descent algorithm to optimize the proposed loss function. We compare our approach to standard methods by applying it to a multi-country COVID-19 dataset for baseline mortality prediction. We show that when little data is available for a country before the onset of the pandemic, leveraging data from other countries can substantially improve prediction accuracy. Importantly, our approach outperforms multi-study stacking and other standard methods in this application. We further characterize the method's performance in data-driven and other simulations. Our method remains competitive with or outperforms multi-study stacking and other earlier methods across a range of between-study heterogeneity levels.
翻译:生物医学的预测任务越来越常见,因为有多种数据集可供模式培训使用。共同的方法,例如汇集数据集和采用标准的统计学习方法,在数据集各异的情况下,可能会导致研究外预测性表现差。理论和应用工作已经显示$\textit{多研究联赛}美元是一个可行的替代方法,可以促进模型的通用性,使各数据集之间的变异性发挥杠杆作用。多研究组合使用两阶段的 $\ textit{stacking}战略,这种战略适合不同研究的模型和估计堆积重量。然而,这种方法忽略了模型安装阶段的混合预测性能,可能导致效率的丧失。因此,我们提议用$\textit{多功能计算,用$textit{ all- in- in- one}方法来利用多功能。多功能组合组合组合组合组合组合组合,我们共同估算混合的堆积重量,以及每个研究特定模型的参数。我们证明,我们的方法有限,在模型应用现有的方法中,例如多功能的堆积和多功能中,我们用现有方法来比现有数据流数据,我们的数据在多功能中,我们用来模拟的模型中,我们用一种最高级的计算。我们用来显示一个模拟的计算方法,我们的标准变换数据。我们的标准数据,我们用来在模拟的模型中,我们用其他的模型中,我们用其他的计算,我们用另一种方法,我们用另一种方法来显示一个模拟的计算。