分析比值和数据比值:了解分布强力优化优化和数据曲线之间的关系 (Algorithmic Bias and Data Bias: Understanding the Relation between Distributionally Robust Optimization and Data Curation)

Machine learning systems based on minimizing average error have been shown to perform inconsistently across notable subsets of the data, which is not exposed by a low average error for the entire dataset. In consequential social and economic applications, where data represent people, this can lead to discrimination of underrepresented gender and ethnic groups. Given the importance of bias mitigation in machine learning, the topic leads to contentious debates on how to ensure fairness in practice (data bias versus algorithmic bias). Distributionally Robust Optimization (DRO) seemingly addresses this problem by minimizing the worst expected risk across subpopulations. We establish theoretical results that clarify the relation between DRO and the optimization of the same loss averaged on an adequately weighted training dataset. The results cover finite and infinite number of training distributions, as well as convex and non-convex loss functions. We show that neither DRO nor curating the training set should be construed as a complete solution for bias mitigation: in the same way that there is no universally robust training set, there is no universal way to setup a DRO problem and ensure a socially acceptable set of results. We then leverage these insights to provide a mininal set of practical recommendations for addressing bias with DRO. Finally, we discuss ramifications of our results in other related applications of DRO, using an example of adversarial robustness. Our results show that there is merit to both the algorithm-focused and the data-focused side of the bias debate, as long as arguments in favor of these positions are precisely qualified and backed by relevant mathematics known today.

翻译：基于尽量减少平均误差的机器学习系统显示,在数据中显著的子集中,其效果不尽一致,没有通过整个数据集的低平均误差暴露出来。因此,在数据代表人的社会和经济应用中,这可能导致对代表性不足的性别和族裔群体的歧视。鉴于在机器学习中减少偏见的重要性,这个专题导致关于如何确保实践中的公平性(数据偏差与算法偏差)的争议性辩论。分配性强优化似乎通过最大限度地减少各子群中最坏的预期风险来解决这一问题。我们建立了理论结果,澄清了DRO与在适当加权培训数据集中优化相同平均损失立场之间的关系。结果涵盖培训分布的有限和无限数量,以及 convex和非convex损失功能。我们表明,无论是DRO还是拼凑培训组合,都不应该被解释为完全解决偏见的解决方案:同样,也没有普遍地支持DRO问题和确保一套社会可接受的成果。我们随后利用这些观点来利用这些观点来进行定量和无限的数学分析,最后将我们所了解的模型的正确性分析结果作为我们所了解的模型的正确性结果。