Nonlinear metrics, such as the F1-score, Matthews correlation coefficient, and Fowlkes-Mallows index, are often used to evaluate the performance of machine learning models, in particular, when facing imbalanced datasets that contain more samples of one class than the other. Recent optimal decision tree algorithms have shown remarkable progress in producing trees that are optimal with respect to linear criteria, such as accuracy, but unfortunately nonlinear metrics remain a challenge. To address this gap, we propose a novel algorithm based on bi-objective optimisation, which treats misclassifications of each binary class as a separate objective. We show that, for a large class of metrics, the optimal tree lies on the Pareto frontier. Consequently, we obtain the optimal tree by using our method to generate the set of all nondominated trees. To the best of our knowledge, this is the first method to compute provably optimal decision trees for nonlinear metrics. Our approach leads to a trade-off when compared to optimising linear metrics: the resulting trees may be more desirable according to the given nonlinear metric at the expense of higher runtimes. Nevertheless, the experiments illustrate that runtimes are reasonable for majority of the tested datasets.
翻译:非线性指标,如F1-线性指标、Matthews相关系数和Fowlkes-Mallows指数,常常被用来评价机器学习模型的性能,特别是当面临包含一个类比另一个类更多的样本的不平衡的数据集时。最近的最佳决策树算法显示,在生产符合线性标准的最佳树木方面取得了显著进展,例如准确性,但不幸的是,非线性指标仍然是一项挑战。为了解决这一差距,我们提出了一个基于双目标优化的新算法,将每个二进制类的分类错误作为一个单独的目标处理。我们表明,对于大类的计量标准,最佳树位于Pareto边界。因此,我们通过使用我们的方法生成所有非以势性树群的最佳树。据我们所知,这是对非线性指标进行可比较的最佳决策树的首个方法。我们的方法在与选择线性指标相比,导致的树木可能更适宜于给定的非线性多数度值。但是,根据给定的非线性多数值的实验,在较高的实验中,以较高的试验成本来说明。