Labeling mistakes are frequently encountered in real-world applications. If not treated well, the labeling mistakes can deteriorate the classification performances of a model seriously. To address this issue, we propose an improved Naive Bayes method for text classification. It is analytically simple and free of subjective judgements on the correct and incorrect labels. By specifying the generating mechanism of incorrect labels, we optimize the corresponding log-likelihood function iteratively by using an EM algorithm. Our simulation and experiment results show that the improved Naive Bayes method greatly improves the performances of the Naive Bayes method with mislabeled data.
翻译:标注错误在现实应用中经常会遇到。如果处理不当,标注错误会严重影响模型的分类性能。为了解决这个问题,我们提出了一种改良的朴素贝叶斯方法用于文本分类。它解析简单,不涉及正确和错误标签的主观判断。通过指定错误标签的生成机制,我们使用EM算法迭代优化相应的对数似然函数。我们的模拟和实验结果表明,改良的朴素贝叶斯方法极大地提高了利用错标数据的朴素贝叶斯算法的性能。