In binary classification, imbalance refers to situations in which one class is heavily under-represented. This issue is due to either a data collection process or because one class is indeed rare in a population. Imbalanced classification frequently arises in applications such as biology, medicine, engineering, and social sciences. In this paper, for the first time, we theoretically study the impact of imbalance class sizes on the linear discriminant analysis (LDA) in high dimensions. We show that due to data scarcity in one class, referred to as the minority class, and high-dimensionality of the feature space, the LDA ignores the minority class yielding a maximum misclassification rate. We then propose a new construction of hard-thresholding rules based on a data splitting technique that reduces the large difference between the misclassification rates. We show that the proposed method is asymptotically optimal. We further study two well-known sparse versions of the LDA in imbalanced cases. We evaluate the finite-sample performance of different methods using simulations and by analyzing two real data sets. The results show that our method either outperforms its competitors or has comparable performance based on a much smaller subset of selected features, while being computationally more efficient.
翻译:在二进制分类中,不平衡是指某一类人代表性严重不足的情况。 这个问题要么是由于数据收集过程,要么是因为某一类人在人口中确实很少。 在生物学、医学、工程学和社会科学等应用中经常出现不平衡分类。 在本文中,我们首次从理论上研究不平衡类规模对高层次线性差异分析(LDA)的影响。我们发现,由于一类人的数据稀缺,被称为少数类,以及地物空间的高度维度,LDA忽略了少数类人得出最高分类率的少数类人。我们然后提议根据数据分离技术制定新的硬持有规则,以缩小分类率之间的巨大差异。我们表明,拟议的方法在高层次线性差异分析(LDA)中是同样最佳的。我们进一步研究了两种广为人所知的LDA在不平衡案件中的稀疏疏多版本。我们利用模拟和两个大真实数据集评估了不同方法的有限性表现。结果显示,我们的方法要么超越了以更小的分数计算方法,要么以更小的分数计算方式计算,要么以较小的分数的方式计算。