We introduce dataset multiplicity, a way to study how inaccuracies, uncertainty, and social bias in training datasets impact test-time predictions. The dataset multiplicity framework asks a counterfactual question of what the set of resultant models (and associated test-time predictions) would be if we could somehow access all hypothetical, unbiased versions of the dataset. We discuss how to use this framework to encapsulate various sources of uncertainty in datasets' factualness, including systemic social bias, data collection practices, and noisy labels or features. We show how to exactly analyze the impacts of dataset multiplicity for a specific model architecture and type of uncertainty: linear models with label errors. Our empirical analysis shows that real-world datasets, under reasonable assumptions, contain many test samples whose predictions are affected by dataset multiplicity. Furthermore, the choice of domain-specific dataset multiplicity definition determines what samples are affected, and whether different demographic groups are disparately impacted. Finally, we discuss implications of dataset multiplicity for machine learning practice and research, including considerations for when model outcomes should not be trusted.
翻译:我们引入了数据集的多样性,一种研究在训练数据集中的不准确性、不确定性和社会偏见如何影响测试时预测的方式。数据集多样性框架提出了一个反事实的问题,即如果我们能够访问所有假设的无偏数据集,那么生成的模型集合(及相关的测试时预测)会是什么样子。我们讨论了如何使用这个框架来包含数据集真实性的各种不确定性,包括系统性社会偏见、数据收集实践和噪声标记或特征。我们展示了如何精确分析特定模型结构和类型不确定性(具有标记错误的线性模型)的数据集多样性的影响。我们的实证分析表明,在合理的假设下,现实世界的数据集包含许多测试样本,其预测受到数据集多样性的影响。此外,特定领域数据集多样性定义的选择决定了哪些样本受影响,以及不同的人口群体是否受到不同影响。最后,我们讨论了数据集多样性对于机器学习实践和研究的影响,包括考虑模型结果不可信的要点。