Collecting more diverse and representative training data is often touted as a remedy for the disparate performance of machine learning predictors across subpopulations. However, a precise framework for understanding how dataset properties like diversity affect learning outcomes is largely lacking. By casting data collection as part of the learning process, we demonstrate that diverse representation in training data is key not only to increasing subgroup performances, but also to achieving population level objectives. Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.
翻译:收集更多样化和更具代表性的培训数据往往被称作是对各亚群的机器学习预测器不同性能的一种补救方法,然而,目前基本上缺乏了解多样性等数据集特性如何影响学习成果的精确框架。通过将数据收集作为学习过程的一部分,我们证明培训数据中的不同代表性不仅对于提高分组业绩,而且对于实现人口水平目标都至关重要。我们的分析和实验描述了数据集构成如何影响业绩,并为利用现有数据的趋势以及域知识提供建设性结果,以帮助指导有意的、客观的数据集设计。