The ultimate performance of machine learning algorithms for classification tasks is usually measured in terms of the empirical error probability (or accuracy) based on a testing dataset. Whereas, these algorithms are optimized through the minimization of a typically different--more convenient--loss function based on a training set. For classification tasks, this loss function is often the negative log-loss that leads to the well-known cross-entropy risk which is typically better behaved (from a numerical perspective) than the error probability. Conventional studies on the generalization error do not usually take into account the underlying mismatch between losses at training and testing phases. In this work, we introduce an analysis based on point-wise PAC approach over the generalization gap considering the mismatch of testing based on the accuracy metric and training on the negative log-loss. We label this analysis PACMAN. Building on the fact that the mentioned mismatch can be written as a likelihood ratio, concentration inequalities can be used to provide some insights for the generalization problem in terms of some point-wise PAC bounds depending on some meaningful information-theoretic quantities. An analysis of the obtained bounds and a comparison with available results in the literature are also provided.
翻译:用于分类任务的机器学习算法的最终性能通常根据测试数据集以经验错误概率(或准确性)来衡量。虽然这些算法的优化是通过最大限度地减少基于培训的典型不同的更方便的损失函数。对于分类任务,这种损失函数往往是负日志损失,导致众所周知的跨肾错位风险,通常比误位概率好(从数字角度)。关于一般化错误的常规研究通常不考虑培训和测试阶段损失之间的内在不匹配。在这项工作中,我们根据基于准确度指标的测试不匹配和关于负日志损失的培训,对一般化差距采用一种基于点对PAC方法的分析。我们将这一分析标为PACMAN。基于上述不匹配可以写成的可能性比率,集中不平等可以用来根据某些有意义的信息理论数量,为某些点对PAC界限的概括问题提供一些洞察力。我们还提供了对所获得的界限的分析以及与文献中现有结果的比较。