While modern large-scale datasets often consist of heterogeneous subpopulations -- for example, multiple demographic groups or multiple text corpora -- the standard practice of minimizing average loss fails to guarantee uniformly low losses across all subpopulations. We propose a convex procedure that controls the worst-case performance over all subpopulations of a given size. Our procedure comes with finite-sample (nonparametric) convergence guarantees on the worst-off subpopulation. Empirically, we observe on lexical similarity, wine quality, and recidivism prediction tasks that our worst-case procedure learns models that do well against unseen subpopulations.
翻译:虽然现代大规模数据集通常由多种亚群组成,例如多人口组或多文本公司,但将平均损失降到最低的标准做法未能保证所有亚群损失的最小程度。我们提出了一个控制特定规模所有亚群最坏业绩的组合程序。我们的程序是对最坏亚群的有限抽样(非参数)融合保证。我们经常观察到的是词汇相似性、葡萄酒质量和累犯预测任务,而我们最坏的子群程序所学习的模式对看不见的亚群非常有利。