Dataset scaling, also known as normalization, is an essential preprocessing step in a machine learning pipeline. It is aimed at adjusting attributes scales in a way that they all vary within the same range. This transformation is known to improve the performance of classification models, but there are several scaling techniques to choose from, and this choice is not generally done carefully. In this paper, we execute a broad experiment comparing the impact of 5 scaling techniques on the performances of 20 classification algorithms among monolithic and ensemble models, applying them to 82 publicly available datasets with varying imbalance ratios. Results show that the choice of scaling technique matters for classification performance, and the performance difference between the best and the worst scaling technique is relevant and statistically significant in most cases. They also indicate that choosing an inadequate technique can be more detrimental to classification performance than not scaling the data at all. We also show how the performance variation of an ensemble model, considering different scaling techniques, tends to be dictated by that of its base model. Finally, we discuss the relationship between a model's sensitivity to the choice of scaling technique and its performance and provide insights into its applicability on different model deployment scenarios. Full results and source code for the experiments in this paper are available in a GitHub repository.\footnote{https://github.com/amorimlb/scaling\_matters}
翻译:在机器学习管道中,数据定级(也称为正常化)是一个必不可少的预处理步骤,它旨在调整属性尺度,使其在相同范围内各不相同。这种转变众所周知,可以提高分类模型的性能,但有几种衡量技术可供选择,而这种选择一般并不谨慎。在本文件中,我们进行了广泛的实验,比较了5种定级技术对单一和组合模型之间20种分类算法性能的影响,将其应用于82个公开提供的、比例不一的数据集。结果显示,为分类性能选择技术事项,以及最佳和最差的定级技术之间的性能差异,在多数情况下都具有相关性和统计意义。它们还表明,选择一种不适当的技术对分类性能可能比完全不缩放数据更有害。我们还展示了一个组合模型的性能变化,考虑到不同的缩放技术,往往取决于其基本模型的性能。最后,我们讨论了模型对缩放技术的选择及其性能的敏感性之间的关系,以及最佳和最差的定级技术之间的性差,并且提供了对不同模型部署性能的精确性能。