The Neyman-Pearson (NP) binary classification paradigm constrains the more severe type of error (e.g., the type I error) under a preferred level while minimizing the other (e.g., the type II error). This paradigm is suitable for applications such as severe disease diagnosis, fraud detection, among others. A series of NP classifiers have been developed to guarantee the type I error control with high probability. However, these existing classifiers involve a sample splitting step: a mixture of class 0 and class 1 observations to construct a scoring function and some left-out class 0 observations to construct a threshold. This splitting enables classifier construction built upon independence, but it amounts to insufficient use of data for training and a potentially higher type II error. Leveraging a canonical linear discriminant analysis model, we derive a quantitative CLT for a certain functional of quadratic forms of the inverse of sample and population covariance matrices, and based on this result, develop for the first time NP classifiers without splitting the training sample. Numerical experiments have confirmed the advantages of our new non-splitting parametric strategy.
翻译:Neyman-Pearson(NP)二进制分类范式(NP)限制较严重类型的错误(例如,I型错误),但将其他错误(例如,II型错误)降到最低程度。这一范式适用于严重疾病诊断、欺诈检测等应用。已经开发出一系列NP分类器,以极有可能的方式保证I型错误控制。然而,这些现有分类器涉及一个抽样分级步骤:将0级和1级观测器混合起来,以构建一个评分函数和一些零级的剩余观测器,以构建一个阈值。这种分化使分类器在独立的基础上得以构建,但相当于没有充分使用数据进行训练,而且可能发生更高程度的第二类差错。我们利用了一种罐状线性线性对立分析模型,我们根据这一结果为样品和人口变异矩阵的某些四元形式计算出一个定量的CLT,在不分割培训样本的情况下首次开发NP分类器。数字实验证实了我们新的非分裂性参数战略的优点。