不平衡二元分类统计理论 (Statistical Theory for Imbalanced Binary Classification)

Within the vast body of statistical theory developed for binary classification, few meaningful results exist for imbalanced classification, in which data are dominated by samples from one of the two classes. Existing theory faces at least two main challenges. First, meaningful results must consider more complex performance measures than classification accuracy. To address this, we characterize a novel generalization of the Bayes-optimal classifier to any performance metric computed from the confusion matrix, and we use this to show how relative performance guarantees can be obtained in terms of the error of estimating the class probability function under uniform ($\mathcal{L}_\infty$) loss. Second, as we show, optimal classification performance depends on certain properties of class imbalance that have not previously been formalized. Specifically, we propose a novel sub-type of class imbalance, which we call Uniform Class Imbalance. We analyze how Uniform Class Imbalance influences optimal classifier performance and show that it necessitates different classifier behavior than other types of class imbalance. We further illustrate these two contributions in the case of $k$-nearest neighbor classification, for which we develop novel guarantees. Together, these results provide some of the first meaningful finite-sample statistical theory for imbalanced binary classification.

翻译：在为二进制分类制定的大量统计理论中,在不平衡分类方面没有多少有意义的结果,在这种分类中,数据主要来自两类中的某一类的样本。现有的理论至少面临两大挑战。首先,有意义的结果必须考虑到比分类准确性更复杂的业绩计量。为了解决这个问题,我们把贝耶斯最佳分类员的新的概括化描述为根据混乱矩阵计算的任何业绩衡量标准,我们用它来说明如何从在统一(mathcal{L ⁇ infty$)损失中估计等级概率函数的错误中获得相对性能保障。第二,正如我们所显示的那样,最佳分类性能取决于以前没有正式确定的类别不平衡的某些特性。具体地说,我们提出了一种新型的分类不平衡子类型,我们称之为统一等级平衡法,我们分析了统一等级对最佳分类性表现的影响,并表明它需要与其他类别不平衡不同的分类行为。我们用美元比最接近的邻居分类法来进一步说明这两种贡献,我们为此制定了新的保证。这些结果共同提供了第一种有意义的定式统计平衡性理论。