We consider regression in which one predicts a response $Y$ with a set of predictors $X$ across different experiments or environments. This is a common setup in many data-driven scientific fields and we argue that statistical inference can benefit from an analysis that takes into account the distributional changes across environments. In particular, it is useful to distinguish between stable and unstable predictors, i.e., predictors which have a fixed or a changing functional dependence on the response, respectively. We introduce stabilized regression which explicitly enforces stability and thus improves generalization performance to previously unseen environments. Our work is motivated by an application in systems biology. Using multiomic data, we demonstrate how hypothesis generation about gene function can benefit from stabilized regression. We believe that a similar line of arguments for exploiting heterogeneity in data can be powerful for many other applications as well. We draw a theoretical connection between multi-environment regression and causal models, which allows to graphically characterize stable versus unstable functional dependence on the response. Formally, we introduce the notion of a stable blanket which is a subset of the predictors that lies between the direct causal predictors and the Markov blanket. We prove that this set is optimal in the sense that a regression based on these predictors minimizes the mean squared prediction error given that the resulting regression generalizes to unseen new environments.
翻译:我们考虑的是回归,在其中,人们预测的是用一套预测器对不同实验或环境作出响应的美元美元美元,这是许多数据驱动的科学领域的一个共同设置。这是许多科学领域的共同设置,我们争辩说,统计推论可以从考虑到各种环境分布变化的分析中受益。特别是,有必要区分稳定与不稳定的预测,即对反应有固定或变化功能依赖的预测,分别区分稳定与不稳定的预测,即对反应有固定或变化功能依赖的预测。我们引入稳定回归,明确加强稳定,从而改善对以往不为人知环境的概括性表现。我们的工作受系统生物学应用的驱动。我们使用多组数据,展示基因功能的假设生成如何从稳定的回归中受益。我们认为,利用数据异质的类似论据对于许多其他应用也是强大的。我们在多环境回归和因果模型之间建立了理论联系,从而得以对反应的稳定与不稳定的功能依赖性进行图形化定性。形式,我们引入了稳定毯子的概念,这是由系统生物学的应用所驱动的。我们演示的是,使用多组数据,我们展示基因生成的假设是如何从稳定的回归中获利的。我们证明,这是以这种最佳的。我们所得出的回归环境。