The integration and use of data from multiple studies, for the development of prediction models is an important task in several scientific fields. We propose a framework for generalist and specialist predictions that leverages multiple datasets, with potential differences in the relationships between predictors and outcomes. Our framework uses stacking, and it includes three major components: 1) an ensemble of prediction models trained on one or more datasets, 2) task-specific utility functions and 3) a no-data-reuse technique for estimating stacking weights. We illustrate that under mild regularity conditions the framework produces stacked prediction functions with oracle properties. In particular we show that the the stacking weights are nearly optimal. We also provide sufficient conditions under which the proposed no-data-reuse technique increases prediction accuracy compared to stacking with data reuse. We perform a simulation study to illustrate these results. We apply our framework to predict mortality using a collection of datasets on long-term exposure to air pollutants.
翻译:将多种研究的数据综合起来并加以使用,以开发预测模型,是若干科学领域的一项重要任务。我们提出了一个通用和专家预测框架,利用多种数据集,在预测数据和结果之间的关系上可能存在差异。我们的框架使用堆叠,包括三个主要组成部分:(1) 一组经过一个或多个数据集培训的预测模型,(2) 任务专用功能,(3) 用于估计堆叠重量的无数据重复使用技术。我们说明,在温和的常规条件下,框架产生堆叠的预测功能。我们特别表明,堆叠的重量几乎是最佳的。我们还提供了充分的条件,使提议的无数据重复使用技术与数据再利用相比,提高了预测的准确性。我们进行了模拟研究,以说明这些结果。我们运用我们的框架,利用收集的关于空气污染物长期暴露的数据集来预测死亡率。