In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. It is often unclear, however, how to operationalize representativeness in specific applications. Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem. We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups. We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific learning rates and costs per sample. This flexible approach incorporates preferences of model-builders and other stakeholders, as well as the statistical properties of the learning task. When data collection decisions are made sequentially, we show that under certain conditions this optimization problem can be efficiently solved even without prior knowledge of the learning rates. To illustrate our approach, we conduct a simulation study of polygenic risk scores on synthetic genomic data -- an application domain that often suffers from non-representative data collection. We find that our adaptive sampling strategy outperforms several common data collection heuristics, including equal and proportional sampling, demonstrating the value of strategic dataset design for building equitable models.
翻译:在从计算机视野到自然语言处理的各个领域,机器学习模式显示存在明显的差异,对传统上服务不足的群体的成员来说,这种差异往往更差。造成这些绩效差距的一个因素是,这些模型所培训的数据缺乏代表性,然而,如何在具体应用中落实代表性往往不明确。在这里,我们正式确定建立公平培训数据集的问题,并提出解决这一问题的统计框架。我们考虑一个设置,模型构建者必须决定如何分配固定的数据收集预算,从不同分组收集培训数据。然后,我们将数据集的创建视为一个有限的优化问题,在其中,根据(估计的)特定群体学习率和每个样本的成本,最大限度地利用特定群体的业绩指标的功能。这种灵活办法包括模型构建者和其他利益相关者的偏好以及学习任务的统计属性。在按顺序做出数据收集决定时,我们发现在某些条件下,即使不事先了解学习率,最优化问题也可以有效解决。为了说明我们的方法,我们对合成基因组数据建设的多因风险分数进行模拟研究,在其中,一个基于(估计)特定群体学习率的学习率和成本的每个样本中,一个应用领域的功能。我们通常会发现,一个从共同的抽样抽样数据收集中发现,一个不同比例的数据收集领域,我们发现,从不具有平等的抽样分析。