Recently, many improved naive Bayes methods have been developed with enhanced discrimination capabilities. Among them, regularized naive Bayes (RNB) produces excellent performance by balancing the discrimination power and generalization capability. Data discretization is important in naive Bayes. By grouping similar values into one interval, the data distribution could be better estimated. However, existing methods including RNB often discretize the data into too few intervals, which may result in a significant information loss. To address this problem, we propose a semi-supervised adaptive discriminative discretization framework for naive Bayes, which could better estimate the data distribution by utilizing both labeled data and unlabeled data through pseudo-labeling techniques. The proposed method also significantly reduces the information loss during discretization by utilizing an adaptive discriminative discretization scheme, and hence greatly improves the discrimination power of classifiers. The proposed RNB+, i.e., regularized naive Bayes utilizing the proposed discretization framework, is systematically evaluated on a wide range of machine-learning datasets. It significantly and consistently outperforms state-of-the-art NB classifiers.
翻译:最近,许多改进的朴素贝叶斯方法已经发展出具有增强判别能力的功能。其中,通过平衡判别能力和泛化能力达到卓越性能的正则化朴素贝叶斯(RNB)方法表现出色。在朴素贝叶斯中,数据离散化非常重要。通过将相似值组合成一个区间,可以更好地估计数据分布情况。然而,包括RNB在内的现有方法通常将数据离散化为太少的间隔,这可能会导致严重的信息损失。为了解决这个问题,我们提出了一个半监督自适应判别离散化框架,它可以通过伪标记技术利用有标签数据和无标签数据来更好地估计数据分布。所提出的方法还通过使用自适应判别离散化方案显著减少了离散化期间的信息损失,从而极大地提高了分类器的判别能力。所提出的RNB+,即利用所提出的离散化框架的正则化朴素贝叶斯,已经在各种机器学习数据集上进行了系统评估。它在性能上显著且一致地优于最先进的NB分类器。