Cross-validation (CV) is one of the main tools for performance estimation and parameter tuning in machine learning. The general recipe for computing CV estimate is to run a learning algorithm separately for each CV fold, a computationally expensive process. In this paper, we propose a new approach to reduce the computational burden of CV-based performance estimation. As opposed to all previous attempts, which are specific to a particular learning model or problem domain, we propose a general method applicable to a large class of incremental learning algorithms, which are uniquely fitted to big data problems. In particular, our method applies to a wide range of supervised and unsupervised learning tasks with different performance criteria, as long as the base learning algorithm is incremental. We show that the running time of the algorithm scales logarithmically, rather than linearly, in the number of CV folds. Furthermore, the algorithm has favorable properties for parallel and distributed implementation. Experiments with state-of-the-art incremental learning algorithms confirm the practicality of the proposed method.
翻译:交叉校准( CV) 是机器学习中业绩估测和参数调控的主要工具之一。 计算 CV 估测的一般方法是分别为每个 CV 折叠单独运行一种学习算法, 这是一种昂贵的计算过程。 在本文中, 我们提出一种新的方法来减少基于 CV 的性能估测的计算负担。 与以往所有尝试相比, 前者是特定学习模式或问题域特有的, 我们建议一种适用于一大批类增量学习算法的一般方法, 后者是独特的, 适合大数据问题 。 特别是, 我们的方法适用于一系列具有不同性能标准的监管和不受监督的学习任务, 只要基本学习算法是递增的。 我们显示, 在 CV 折数中, 算法的运行时间不是线性, 而是逻辑性的。 此外, 算法对于平行实施和分布实施具有有利的特性。 与最先进的增量学习算法的实验证实了拟议方法的实用性。