Many problems require to optimize empirical risk functions over large data sets. Gradient descent methods that calculate the full gradient in every descent step do not scale to such datasets. Various flavours of Stochastic Gradient Descent (SGD) replace the expensive summation that computes the full gradient by approximating it with a small sum over a randomly selected subsample of the data set that in turn suffers from a high variance. We present a different approach that is inspired by classical results of Tchakaloff and Carath\'eodory about measure reduction. These results allow to replace an empirical measure with another, carefully constructed probability measure that has a much smaller support, but can preserve certain statistics such as the expected gradient. To turn this into scalable algorithms we firstly, adaptively select the descent steps where the measure reduction is carried out; secondly, we combine this with Block Coordinate Descent so that measure reduction can be done very cheaply. This makes the resulting methods scalable to high-dimensional spaces. Finally, we provide an experimental validation and comparison.
翻译:要优化大型数据集的经验风险功能,有许多问题需要优化大型数据集的经验风险功能。 计算每个下降步骤的完整梯度的渐渐下降方法不至于如此数据集。 细微梯度(Stochatistic Gradientle Ground)的各种花样取代了计算整个梯度的昂贵的加价, 以随机选择的一组数据集的子样本为代表, 而这些子样本又反过来受到高差异的影响。 我们提出了一个不同的方法, 由Tchakaloff 和 Carath\'eoory 的经典结果启发, 关于减少测量尺度的典型结果。 这些结果允许用另一个经过仔细构建的概率测量来取代实验性测量方法, 其支持性要小得多, 但可以保存某些统计, 如预期梯度等 。 要将此转换为可缩放的算法, 我们首先在进行测量减排时, 调整地选择降级步骤 ; 第二, 我们将此方法与块协调源系合并起来, 从而可以非常廉价地进行减排 。 这就使得由此产生的方法可以缩到高维空间。 最后, 我们提供实验性验证和比较 。