Lack of diversity in data collection has caused significant failures in machine learning (ML) applications. While ML developers perform post-collection interventions, these are time intensive and rarely comprehensive. Thus, new methods to track and manage data collection, iteration, and model training are necessary for evaluating whether datasets reflect real world variability. We present designing data, an iterative, bias mitigating approach to data collection connecting HCI concepts with ML techniques. Our process includes (1) Pre-Collection Planning, to reflexively prompt and document expected data distributions; (2) Collection Monitoring, to systematically encourage sampling diversity; and (3) Data Familiarity, to identify samples that are unfamiliar to a model through Out-of-Distribution (OOD) methods. We instantiate designing data through our own data collection and applied ML case study. We find models trained on "designed" datasets generalize better across intersectional groups than those trained on similarly sized but less targeted datasets, and that data familiarity is effective for debugging datasets.
翻译:数据收集缺乏多样性导致机器学习(ML)应用方面的严重失败,虽然ML开发商在收集后采取干预措施,但时间密集,很少全面,因此,有必要采用新的方法来跟踪和管理数据收集、迭代和示范培训,以评估数据集是否反映真实的世界差异。我们在将HCI概念与ML技术连接起来的数据收集中提出了设计数据、迭代的、有偏见的缓解方法。我们的过程包括:(1) 分类前规划,反应迅速,记录预期的数据分配;(2) 收集监测,系统地鼓励取样多样性;(3) 数据熟悉度,通过分配以外的方法查明不熟悉模型的样本。我们通过自己的数据收集和应用 ML 案例研究来即时设计数据。我们发现,“设计”数据集的模型比在类似规模但目标较少的数据集方面受过培训的数据集更普遍化,数据熟悉对于调试数据集是有效的。