Prediction models can perform poorly when deployed to target distributions different from the training distribution. To understand these operational failure modes, we develop a method, called DIstribution Shift DEcomposition (DISDE), to attribute a drop in performance to different types of distribution shifts. Our approach decomposes the performance drop into terms for 1) an increase in harder but frequently seen examples from training, 2) changes in the relationship between features and outcomes, and 3) poor performance on examples infrequent or unseen during training. These terms are defined by fixing a distribution on $X$ while varying the conditional distribution of $Y \mid X$ between training and target, or by fixing the conditional distribution of $Y \mid X$ while varying the distribution on $X$. In order to do this, we define a hypothetical distribution on $X$ consisting of values common in both training and target, over which it is easy to compare $Y \mid X$ and thus predictive performance. We estimate performance on this hypothetical distribution via reweighting methods. Empirically, we show how our method can 1) inform potential modeling improvements across distribution shifts for employment prediction on tabular census data, and 2) help to explain why certain domain adaptation methods fail to improve model performance for satellite image classification.
翻译:预测模型在被部署到与训练分布不同的目标分布时,可能会表现不佳。为了理解这些操作性失效模式,我们开发了一种方法,称为分布偏移分解(DISDE),以将性能下降归因于不同类型的分布偏移。我们的方法将性能下降分解为以下三个术语:1)训练集中更难但更频繁出现的示例增加;2)特征与结果之间的关系发生变化;3)在训练期间不频繁或未出现的示例表现不佳。这些术语是通过固定 $X$ 上的一个分布来定义的,同时在训练和目标之间变化 $Y|X$ 的条件分布,或者通过固定 $Y|X$ 的条件分布,同时在 $X$ 上变化分布。为此,我们定义了一个假设分布,包含训练和目标中常见的值,可以轻松比较 $Y|X$ 并预测性能。通过重新加权的方法估计在这个假设分布上的性能。在实证方面,我们展示了我们的方法如何:1)为表格式人口调查数据中的就业预测指示潜在的建模改进方法;2)帮助解释为什么一些域适应方法无法改善卫星图像分类模型性能。