设计数据:机器学习前期数据收集和迭代 (Designing Data: Proactive Data Collection and Iteration for Machine Learning)

Lack of diversity in data collection has caused significant failures in machine learning (ML) applications. While ML developers perform post-collection interventions, these are time intensive and rarely comprehensive. Thus, new methods to track and manage data collection, iteration, and model training are necessary for evaluating whether datasets reflect real world variability. We present designing data, an iterative, bias mitigating approach to data collection connecting HCI concepts with ML techniques. Our process includes (1) Pre-Collection Planning, to reflexively prompt and document expected data distributions; (2) Collection Monitoring, to systematically encourage sampling diversity; and (3) Data Familiarity, to identify samples that are unfamiliar to a model through Out-of-Distribution (OOD) methods. We instantiate designing data through our own data collection and applied ML case study. We find models trained on "designed" datasets generalize better across intersectional groups than those trained on similarly sized but less targeted datasets, and that data familiarity is effective for debugging datasets.

翻译：数据收集缺乏多样性导致机器学习(ML)应用方面的严重失败,虽然ML开发商在收集后采取干预措施,但时间密集,很少全面,因此,有必要采用新的方法来跟踪和管理数据收集、迭代和示范培训,以评估数据集是否反映真实的世界差异。我们在将HCI概念与ML技术连接起来的数据收集中提出了设计数据、迭代的、有偏见的缓解方法。我们的过程包括:(1) 分类前规划,反应迅速,记录预期的数据分配;(2) 收集监测,系统地鼓励取样多样性;(3) 数据熟悉度,通过分配以外的方法查明不熟悉模型的样本。我们通过自己的数据收集和应用 ML 案例研究来即时设计数据。我们发现,“设计”数据集的模型比在类似规模但目标较少的数据集方面受过培训的数据集更普遍化,数据熟悉对于调试数据集是有效的。

相关内容

Machine Learning

关注 2220

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

70+阅读 · 2022年6月28日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

97+阅读 · 2022年2月10日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

132+阅读 · 2022年2月6日