Classification algorithms face difficulties when one or more classes have limited training data. We are particularly interested in classification trees, due to their interpretability and flexibility. When data are limited in one or more of the classes, the estimated decision boundaries are often irregularly shaped due to the limited sample size, leading to poor generalization error. We propose a novel approach that penalizes the Surface-to-Volume Ratio (SVR) of the decision set, obtaining a new class of SVR-Tree algorithms. We develop a simple and computationally efficient implementation while proving estimation consistency for SVR-Tree and rate of convergence for an idealized empirical risk minimizer of SVR-Tree. SVR-Tree is compared with multiple algorithms that are designed to deal with imbalance through real data applications.
翻译:当一个或一个以上班级的培训数据有限时,分类算法面临困难。我们特别感兴趣的是分类树,因为其可解释性和灵活性。当数据在一个或多个班级中受到限制时,估计的决定界限往往由于抽样规模有限而不规则地形成,导致一般化错误差。我们提出一种新的办法,惩罚决策组的地对地对地比率(SVR-Tree),获得一个新的SVR-Tree算法类别。我们发展了一个简单和计算效率高的实施方法,同时证明SVR-Tree的估算一致性,以及SVR-Tree理想化的经验风险最小化最小化的趋同率。SVR-Tree与旨在通过实际数据应用处理不平衡的多种算法相比较。