When facing data with imbalanced classes or groups, practitioners follow an intriguing strategy to achieve best results. They throw away examples until the classes or groups are balanced in size, and then perform empirical risk minimization on the reduced training set. This opposes common wisdom in learning theory, where the expected error is supposed to decrease as the dataset grows in size. In this work, we leverage extreme value theory to address this apparent contradiction. Our results show that the tails of the data distribution play an important role in determining the worst-group-accuracy of linear classifiers. When learning on data with heavy tails, throwing away data restores the geometric symmetry of the resulting classifier, and therefore improves its worst-group generalization.
翻译:当面对不平衡的类别或组别的数据时,执业者遵循一种有趣的策略来取得最佳结果。他们丢弃实例,直到这些类别或组别在规模上达到平衡,然后对减少的训练组进行实验风险最小化。这与学习理论中的常识相悖,因为学习理论中预期的错误会随着数据集的大小增长而减少。在这项工作中,我们利用极值理论来解决这一明显的矛盾。我们的结果表明,数据分布的尾部在确定线性分类者最差的组别准确性方面起着重要作用。当学习重尾部数据时,丢弃数据可以恢复由此产生的分类者的几何对称,从而改进最差组的概括性。