Modern machine learning methods and the availability of large-scale data have significantly advanced our ability to predict target quantities from large sets of covariates. However, these methods often struggle under distributional shifts, particularly in the presence of hidden confounding. While the impact of hidden confounding is well-studied in causal effect estimation, e.g., instrumental variables, its implications for prediction tasks under shifting distributions remain underexplored. This work addresses this gap by introducing a strong notion of invariance that, unlike existing weaker notions, allows for distribution generalization even in the presence of nonlinear, non-identifiable structural functions. Central to this framework is the Boosted Control Function (BCF), a novel, identifiable target of inference that satisfies the proposed strong invariance notion and is provably worst-case optimal under distributional shifts. The theoretical foundation of our work lies in Simultaneous Equation Models for Distribution Generalization (SIMDGs), which bridge machine learning with econometrics by describing data-generating processes under distributional shifts. To put these insights into practice, we propose the ControlTwicing algorithm to estimate the BCF using nonparametric machine-learning techniques and study its generalization performance on synthetic and real-world datasets compared to robust and empirical risk minimization approaches.
翻译:现代机器学习方法和大规模数据的可用性显著提升了我们从大量协变量中预测目标变量的能力。然而,这些方法在分布偏移下往往表现不佳,尤其是在存在隐藏混杂因素的情况下。尽管隐藏混杂的影响在因果效应估计(如工具变量法)中已得到充分研究,但其在分布偏移下对预测任务的影响仍未得到充分探索。本研究通过引入一种强不变性概念来填补这一空白,与现有较弱概念不同,该概念即使在非线性、不可识别的结构函数存在的情况下也能实现分布泛化。该框架的核心是增强控制函数(BCF),这是一种新颖且可识别的推断目标,它满足所提出的强不变性概念,并在分布偏移下被证明具有最坏情况最优性。我们工作的理论基础在于分布泛化的联立方程模型(SIMDGs),该模型通过描述分布偏移下的数据生成过程,将机器学习与计量经济学联系起来。为了将这些见解付诸实践,我们提出了ControlTwicing算法,利用非参数机器学习技术估计BCF,并在合成和真实数据集上研究了其泛化性能,并与鲁棒方法及经验风险最小化方法进行了比较。