Nowadays, machine learning (ML) plays a vital role in many aspects of our daily life. In essence, building well-performing ML applications requires the provision of high-quality data throughout the entire life-cycle of such applications. Nevertheless, most of the real-world tabular data suffer from different types of discrepancies, such as missing values, outliers, duplicates, pattern violation, and inconsistencies. Such discrepancies typically emerge while collecting, transferring, storing, and/or integrating the data. To deal with these discrepancies, numerous data cleaning methods have been introduced. However, the majority of such methods broadly overlook the requirements imposed by downstream ML models. As a result, the potential of utilizing these data cleaning methods in ML pipelines is predominantly unrevealed. In this work, we introduce a comprehensive benchmark, called REIN1, to thoroughly investigate the impact of data cleaning methods on various ML models. Through the benchmark, we provide answers to important research questions, e.g., where and whether data cleaning is a necessary step in ML pipelines. To this end, the benchmark examines 38 simple and advanced error detection and repair methods. To evaluate these methods, we utilized a wide collection of ML models trained on 14 publicly-available datasets covering different domains and encompassing realistic as well as synthetic error profiles.
翻译:目前,机器学习(ML)在我们日常生活的许多方面发挥着关键作用,但从本质上讲,建立良好的ML应用系统需要在整个这种应用的生命周期中提供高质量的数据,然而,大多数真实世界的列表数据都存在不同种类的差异,如缺失值、外部线、重复、模式违反和不一致,这种差异通常是在收集、转让、储存和/或整合数据时产生的,为了处理这些差异,采用了许多数据清理方法,然而,这类方法大多广泛忽视下游ML模型的要求,因此,在ML管道中利用这些数据清理方法的潜力基本上没有被挖掘出来。在这项工作中,我们采用了一个综合基准,称为REIN1,以彻底调查数据清理方法对各种ML模型的影响。我们通过基准,为一些重要的研究问题提供了答案,例如,在哪些方面以及数据清理是否是ML管道中的必要步骤。为此,这类方法大体上忽视了下游模型的要求。因此,在ML管道中使用这些数据清洁方法的可能性基本上没有被挖掘出来。我们采用一个全面的基准,即全面调查数据清理方法,将14个经过广泛培训的合成模型作为可改进的模型。