A wide range of optimization problems arising in machine learning can be solved by gradient descent algorithms, and a central question in this area is how to efficiently compress a large-scale dataset so as to reduce the computational complexity. {\em Coreset} is a popular data compression technique that has been extensively studied before. However, most of existing coreset methods are problem-dependent and cannot be used as a general tool for a broader range of applications. A key obstacle is that they often rely on the pseudo-dimension and total sensitivity bound that can be very high or hard to obtain. In this paper, based on the ''locality'' property of gradient descent algorithms, we propose a new framework, termed ''sequential coreset'', which effectively avoids these obstacles. Moreover, our method is particularly suitable for sparse optimization whence the coreset size can be further reduced to be only poly-logarithmically dependent on the dimension. In practice, the experimental results suggest that our method can save a large amount of running time compared with the baseline algorithms.
翻译:机器学习中产生的一系列优化问题可以通过梯度下移算法来解决,该领域的一个中心问题是,如何有效地压缩大型数据集,以减少计算复杂性。 ~ em Coreset} 是一种流行的数据压缩技术,以前已经对此进行了广泛研究。 然而, 现有的大多数核心集方法都取决于问题, 无法用作更广泛应用的一般工具。 一个关键的障碍是,它们往往依赖伪分解和总敏感度的约束, 而这种约束可能非常高或难以获得。 在本文中,根据梯度下移算法的“ 位置” 属性, 我们提出了一个新框架, 称为“ 序列核心集 ”, 有效地避免了这些障碍。 此外, 我们的方法特别适合在核心集大小进一步缩小到只能依赖维度的微小优化。 实际上, 实验结果表明, 我们的方法可以节省大量运行时间, 与基线算法相比。