Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.
翻译:数据质量影响到机器学习模型的性能,数据科学家花费大量时间在模型培训之前进行数据清理,然而,迄今为止,还没有一项严格的研究,研究准确清洁如何影响ML-ML-ML社区通常侧重于开发对于某些分布的某些特定噪音类型具有强大力的ML算法,而数据库(DB)社区大多只研究数据清理问题,而没有考虑下游ML分析仪如何消耗数据。我们提议进行一项清洁洗钱研究,系统调查数据清理对ML分类任务的影响。公开源代码和可推广的清洁ML研究目前包括14个真实世界数据集,有实际错误、5个常见错误类型、7个不同的ML模式和每种错误类型的多种清理算法(包括实践中常用的算法以及学术文献中的最新解决方案)。我们用统计假设测试来控制ML实验的随机性,我们还控制使用Benjani-Yekutieli(BYYYYYYYYYYYY)程序进行实验中的虚假发现率。我们系统地分析结果,以便得出许多有趣的和非前方研究方向。