在分配变动下对示范性绩效进行诊断</s> (Diagnosing Model Performance Under Distribution Shift)

Prediction models can perform poorly when deployed to target distributions different from the training distribution. To understand these operational failure modes, we develop a method, called DIstribution Shift DEcomposition (DISDE), to attribute a drop in performance to different types of distribution shifts. Our approach decomposes the performance drop into terms for 1) an increase in harder but frequently seen examples from training, 2) changes in the relationship between features and outcomes, and 3) poor performance on examples infrequent or unseen during training. These terms are defined by fixing a distribution on $X$ while varying the conditional distribution of $Y \mid X$ between training and target, or by fixing the conditional distribution of $Y \mid X$ while varying the distribution on $X$. In order to do this, we define a hypothetical distribution on $X$ consisting of values common in both training and target, over which it is easy to compare $Y \mid X$ and thus predictive performance. We estimate performance on this hypothetical distribution via reweighting methods. Empirically, we show how our method can 1) inform potential modeling improvements across distribution shifts for employment prediction on tabular census data, and 2) help to explain why certain domain adaptation methods fail to improve model performance for satellite image classification.

翻译：如果将预测模型用于与培训分布不同的分布目标,那么这些预测模型可能表现不佳。为了理解这些操作失败模式,我们制定了一种方法,称为Disprition Shift Decomposition(DISDE),将性能下降归因于不同类型的分布变化。我们的方法分解了性能下降到条件上:1 培训中较难但经常看到的例子增加,2 特点和结果之间的关系变化,3 培训期间不常见或看不见的例子的性能差。这些术语的定义是确定以X美元为单位的分布,同时在培训与目标之间对以美元为单位的每美元中以美元为单位分配,或确定以美元为单位分配的有条件的X美元=Mid(DISDE),同时将分配额以美元为单位,将性能下降归因于不同种类的分布。为了做到这一点,我们的方法界定了以美元为单位的假设性能下降,包括培训和目标中共同值,因此很容易比较X美元与X美元之间的关系,3 在培训期间不常见或无法预测性能。我们通过重新加权方法来估计这种假设性分布的绩效。我们展示了我们的方法可以如何(1) 告知我们的方法能够告知在表格普查数据中说明在就业模型预测中进行分配模式预测中的潜在性变化中进行可能的改进的原因。</s>