We consider a regression setting where observations are collected in different environments modeled by different data distributions. The field of out-of-distribution (OOD) generalization aims to design methods that generalize better to test environments whose distributions differ from those observed during training. One line of such works has proposed to minimize the maximum risk across environments, a principle that we refer to as MaxRM (Maximum Risk Minimization). In this work, we introduce variants of random forests based on the principle of MaxRM. We provide computationally efficient algorithms and prove statistical consistency for our primary method. Our proposed method can be used with each of the following three risks: the mean squared error, the negative reward (which relates to the explained variance), and the regret (which quantifies the excess risk relative to the best predictor). For MaxRM with regret as the risk, we prove a novel out-of-sample guarantee over unseen test distributions. Finally, we evaluate the proposed methods on both simulated and real-world data.
翻译:我们考虑一个回归场景,其中观测数据收集自不同环境,这些环境由不同的数据分布建模。分布外泛化领域旨在设计能够更好地泛化到与训练期间观测分布不同的测试环境的方法。此类研究的一个方向提出最小化跨环境的最大风险,这一原则我们称之为MaxRM(最大风险最小化)。在本研究中,我们基于MaxRM原则引入了随机森林的变体。我们为我们的主要方法提供了计算高效的算法,并证明了其统计一致性。我们提出的方法可与以下三种风险中的任意一种结合使用:均方误差、负奖励(与解释方差相关)以及遗憾(量化相对于最佳预测器的超额风险)。对于以遗憾作为风险的MaxRM,我们证明了其在未见测试分布上的新颖样本外保证。最后,我们在模拟数据和真实世界数据上评估了所提出的方法。