Data pruning algorithms are commonly used to reduce the memory and computational cost of the optimization process. Recent empirical results reveal that random data pruning remains a strong baseline and outperforms most existing data pruning methods in the high compression regime, i.e., where a fraction of $30\%$ or less of the data is kept. This regime has recently attracted a lot of interest as a result of the role of data pruning in improving the so-called neural scaling laws; in [Sorscher et al.], the authors showed the need for high-quality data pruning algorithms in order to beat the sample power law. In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate ``No Free Lunch" theorems for data pruning and present calibration protocols that enhance the performance of existing pruning algorithms in this high compression regime using randomization.
翻译:数据运行算法通常用于减少优化过程的内存和计算成本。 最近的实证结果表明,随机数据运行算法依然是一个强大的基线,并且超过了高压缩机制中大多数现有的数据运行方法,也就是说,在高压缩制度中保留了30美元或以下的一小部分数据。由于数据运行算法在改进所谓的神经缩放法方面的作用,这一系统最近吸引了许多人的兴趣;在[Sorsecher 等人]中,作者表明需要高质量的数据运行算法,以击败抽样权力法。在这项工作中,我们侧重于基于分数的数据运行算法,并用理论和经验来显示在高压缩制度中这种算法失败的原因。我们展示了数据运行“无免费午餐”的标语,并展示了校准协议,用随机化方法提高当前高压缩制度中使用高压缩法的运行算法的性能。