Researchers usually discretize a continuous dependent variable into two target classes by introducing an artificial discretization threshold (e.g., median). However, such discretization may introduce noise (i.e., discretization noise) due to ambiguous class loyalty of data points that are close to the artificial threshold. Previous studies do not provide a clear directive on the impact of discretization noise on the classifiers and how to handle such noise. In this paper, we propose a framework to help researchers and practitioners systematically estimate the impact of discretization noise on classifiers in terms of its impact on various performance measures and the interpretation of classifiers. Through a case study of 7 software engineering datasets, we find that: 1) discretization noise affects the different performance measures of a classifier differently for different datasets; 2) Though the interpretation of the classifiers are impacted by the discretization noise on the whole, the top 3 most important features are not affected by the discretization noise. Therefore, we suggest that practitioners and researchers use our framework to understand the impact of discretization noise on the performance of their built classifiers and estimate the exact amount of discretization noise to be discarded from the dataset to avoid the negative impact of such noise.
翻译:研究人员通常通过引入人工离散阈值(例如中位数)将连续依赖变量分解成两个目标类别;然而,这种离散可能会带来噪音(即离散噪声),因为离散点对接近人工阈值的数据点的等级忠诚度不同; 先前的研究没有就离散噪声对分类器的影响以及如何处理这种噪音提供明确的指令; 在本文件中,我们提出了一个框架,帮助研究人员和从业人员系统地估计离散噪声对分类器的影响,即它对各种性能措施和分类器解释的影响。 通过对7个软件工程数据集的案例研究,我们发现:(1)离散噪声对不同数据集的分类器的不同性计量影响不同;(2) 尽管对分类器的解释受到整体离散噪音的影响,但最重要的3个特征并不受到离散噪音的影响。因此,我们建议,从业人员和研究人员利用我们的框架了解离散噪声对其建筑分类器的性能的影响,并估计离散噪声的确切数量,以便从该数据集中丢弃以避免这种噪音的负面影响。