The vast majority of statistical theory on binary classification characterizes performance in terms of accuracy. However, accuracy is known in many cases to poorly reflect the practical consequences of classification error, most famously in imbalanced binary classification, where data are dominated by samples from one of two classes. The first part of this paper derives a novel generalization of the Bayes-optimal classifier from accuracy to any performance metric computed from the confusion matrix. Specifically, this result (a) demonstrates that stochastic classifiers sometimes outperform the best possible deterministic classifier and (b) removes an empirically unverifiable absolute continuity assumption that is poorly understood but pervades existing results. We then demonstrate how to use this generalized Bayes classifier to obtain regret bounds in terms of the error of estimating regression functions under uniform loss. Finally, we use these results to develop some of the first finite-sample statistical guarantees specific to imbalanced binary classification. Specifically, we demonstrate that optimal classification performance depends on properties of class imbalance, such as a novel notion called Uniform Class Imbalance, that have not previously been formalized. We further illustrate these contributions numerically in the case of $k$-nearest neighbor classification
翻译:有关二进制分类的绝大多数统计理论都以准确性为特征,然而,众所周知,在许多情况中,准确性没有充分反映分类错误的实际后果,最著名的是不平衡的二进制分类,数据主要来自两类中的某一类。本文件第一部分从精确性到任何从混乱矩阵计算的业绩衡量标准,对巴耶斯最佳分类者作了新的概括性归纳,具体地说,这一结果(a) 表明,随机性分类者有时比可能的最佳确定性分类者表现得更好,(b) 消除了一种实证上无法核实的绝对连续性假设,这种假设不易理解,但渗透了现有结果。然后,我们展示了如何使用这个通用的贝亚斯分类者,在估算统一损失下的回归功能时,获得遗憾的界限。最后,我们利用这些结果来开发出一些首个限定性统计抽样保证,具体地说,就是不平衡的二进制分类。具体地说,我们证明,最佳分类性表现取决于类别不平衡的特性,例如以前没有正式化过的新概念。我们进一步用数字来说明这些贡献。