The empirical risk minimization approach to data-driven decision making assumes that we can learn a decision rule from training data drawn under the same conditions as the ones we want to deploy it in. However, in a number of settings, we may be concerned that our training sample is biased, and that some groups (characterized by either observable or unobservable attributes) may be under- or over-represented relative to the general population; and in this setting empirical risk minimization over the training set may fail to yield rules that perform well at deployment. We propose a model of sampling bias called $\Gamma$-biased sampling, where observed covariates can affect the probability of sample selection arbitrarily much but the amount of unexplained variation in the probability of sample selection is bounded by a constant factor. Applying the distributionally robust optimization framework, we propose a method for learning a decision rule that minimizes the worst-case risk incurred under a family of test distributions that can generate the training distribution under $\Gamma$-biased sampling. We apply a result of Rockafellar and Uryasev to show that this problem is equivalent to an augmented convex risk minimization problem. We give statistical guarantees for learning a model that is robust to sampling bias via the method of sieves, and propose a deep learning algorithm whose loss function captures our robust learning target. We empirically validate our proposed method in simulations and a case study on ICU length of stay prediction.
翻译:在数据驱动的决策中,根据经验风险最小化方法,假设我们可以从培训数据中学习一种决策规则,这种培训数据是在与我们想部署的数据相同的条件下得出的。然而,在一些环境下,我们可能担心我们的培训样本存在偏差,有些群体(按可观察或不可观察属性归类)相对于一般人口而言,可能代表不足或过多;在这种假设中,将培训数据集的经验风险最小化可能无法产生在部署时效果良好的规则。我们提议了一个抽样偏差模型,称为$\Gamma$比值抽样,在这种模式中观察到的变量会任意地影响抽样选择的概率,但抽样选择概率方面无法解释的差异程度却受一个不变因素的束缚。运用分布稳健的优化框架,我们提出一种方法来学习一项决策规则,最大限度地减少在一组测试分布下发生的最坏风险,从而产生以美元表示偏差的抽样模式。我们应用了Rockafellar和Uryasev的模拟结果,以显示这一问题相当于通过稳健的准确性评估方法选择的精确度,我们通过一个统计模型学习模型的模型,从而学习一个学习最精确地计算风险的方法。