Integrative analysis of data from multiple sources is critical to making generalizable discoveries. Associations that are consistently observed across multiple source populations are more likely to be generalized to target populations with possible distributional shifts. In this paper, we model the heterogeneous multi-source data with multiple high-dimensional regressions and make inferences for the maximin effect (Meinshausen, B{\"u}hlmann, AoS, 43(4), 1801--1830). The maximin effect provides a measure of stable associations across multi-source data. A significant maximin effect indicates that a variable has commonly shared effects across multiple source populations, and these shared effects may be generalized to a broader set of target populations. There are challenges associated with inferring maximin effects because its point estimator can have a non-standard limiting distribution. We devise a novel sampling method to construct valid confidence intervals for maximin effects. The proposed confidence interval attains a parametric length. This sampling procedure and the related theoretical analysis are of independent interest for solving other non-standard inference problems. Using genetic data on yeast growth in multiple environments, we demonstrate that the genetic variants with significant maximin effects have generalizable effects under new environments.
翻译:对来自多种来源的数据进行综合分析,对于实现普遍性发现至关重要。在多种来源人群之间观察到的一致观测的协会更有可能被广泛推广到分布变化可能发生的目标人群。在本文件中,我们用多种多元多源数据模型,用多重高位回归模型进行多源数据模型,并对最大效果作出推断(Meinshausen, Bx'u}hlmann, Aos, 43(4), 1801-1830)。最大效果提供了多种来源数据之间稳定关联的尺度。一个重大最大效果表明,一个变量在多个来源人群之间有着共同的影响,而这些共同的影响可能普遍适用于更广泛的目标人群。在推断最大效果方面存在着挑战,因为其点的测算符可能具有非标准性的限制分布。我们设计了一种新型的抽样方法,为最大效果建立有效的信任间隔。拟议的信任间隔达到一个参数长度。这一抽样程序和相关的理论分析对于解决其他非标准推论问题具有独立的兴趣。在多个环境中使用对酵母增长的遗传数据,我们证明遗传变量具有重大的整体效应。