Determining the optimal sample complexity of PAC learning in the realizable setting was a central open problem in learning theory for decades. Finally, the seminal work by Hanneke (2016) gave an algorithm with a provably optimal sample complexity. His algorithm is based on a careful and structured sub-sampling of the training data and then returning a majority vote among hypotheses trained on each of the sub-samples. While being a very exciting theoretical result, it has not had much impact in practice, in part due to inefficiency, since it constructs a polynomial number of sub-samples of the training data, each of linear size. In this work, we prove the surprising result that the practical and classic heuristic bagging (a.k.a. bootstrap aggregation), due to Breiman (1996), is in fact also an optimal PAC learner. Bagging pre-dates Hanneke's algorithm by twenty years and is taught in most undergraduate machine learning courses. Moreover, we show that it only requires a logarithmic number of sub-samples to reach optimality.
翻译:确定在可实现环境中PAC学习的最佳样本复杂性是数十年来学习理论中一个核心的开放问题。 最后,Hanneke(Hanneke)(Hanneke)的开创性工作使算法具有一种可以想象的最佳样本复杂性。他的算法基于对培训数据的仔细和结构化的子抽样,然后在每一个子样本中经过培训的假设中恢复多数选票。虽然这是一个非常令人兴奋的理论结果,但在实践中并没有产生多大影响,部分原因是效率低下,因为它构建了一个多数值的培训数据子样本,每个线性大小。在这项工作中,我们证明一个令人惊讶的结果是,由于布雷曼(1996年),实际和经典的超模范包装(a.k.a. 靴套)事实上也是一个最佳的PAC学习者。拖动了Hanneke的预算法20年,在多数本科机器学习课程中教授。 此外,我们证明它只需要一个对子样本进行对数的对数才能达到最佳性。