In many classification models, data is discretized to better estimate its distribution. Existing discretization methods often target at maximizing the discriminant power of discretized data, while overlooking the fact that the primary target of data discretization in classification is to improve the generalization performance. As a result, the data tend to be over-split into many small bins since the data without discretization retain the maximal discriminant information. Thus, we propose a Max-Dependency-Min-Divergence (MDmD) criterion that maximizes both the discriminant information and generalization ability of the discretized data. More specifically, the Max-Dependency criterion maximizes the statistical dependency between the discretized data and the classification variable while the Min-Divergence criterion explicitly minimizes the JS-divergence between the training data and the validation data for a given discretization scheme. The proposed MDmD criterion is technically appealing, but it is difficult to reliably estimate the high-order joint distributions of attributes and the classification variable. We hence further propose a more practical solution, Max-Relevance-Min-Divergence (MRmD) discretization scheme, where each attribute is discretized separately, by simultaneously maximizing the discriminant information and the generalization ability of the discretized data. The proposed MRmD is compared with the state-of-the-art discretization algorithms under the naive Bayes classification framework on 45 machine-learning benchmark datasets. It significantly outperforms all the compared methods on most of the datasets.
 翻译:一种最大相关性最小互信息离散化准则及其在朴素贝叶斯上的应用
翻译后的摘要:
在许多分类模型中,为了更好地估计数据分布,数据需要被离散化处理。现有的离散化方法通常旨在最大化离散化后数据的判别力,而忽略了离散化在分类中提高泛化性能的主要目标。因此,为了同时最大化离散化后数据的判别信息和泛化性能,我们提出了一个最大相关性最小互信息(MDmD)离散化准则,其中最大依赖性准则最大化离散化数据与分类变量之间的统计依赖性,而最小散度准则明确地最小化给定离散化方案下训练数据与验证数据之间的JS散度。然而,MDmD离散化准则难以可靠地估计属性和分类变量的高阶联合分布。针对该问题,我们进一步提出了一种更加实用的离散化方案,最大相关性最小互信息(MRmD)离散化方案,其中每个属性都是单独离散化的,同时最大化离散化数据的判别信息和泛化性能。该提出的MRmD离散化方法在45个机器学习基准数据集上使用朴素贝叶斯分类框架进行了比较。结果显示,在大多数数据集上,它都显著优于所有比较方法。