Training neural networks on a large dataset requires substantial computational costs. Dataset reduction selects or synthesizes data instances based on the large dataset, while minimizing the degradation in generalization performance from the full dataset. Existing methods utilize the neural network during the dataset reduction procedure, so the model parameter becomes important factor in preserving the performance after reduction. By depending upon the importance of parameters, this paper introduces a new reduction objective, coined LCMat, which Matches the Loss Curvatures of the original dataset and reduced dataset over the model parameter space, more than the parameter point. This new objective induces a better adaptation of the reduced dataset on the perturbed parameter region than the exact point matching. Particularly, we identify the worst case of the loss curvature gap from the local parameter region, and we derive the implementable upper bound of such worst-case with theoretical analyses. Our experiments on both coreset selection and condensation benchmarks illustrate that LCMat shows better generalization performances than existing baselines.
翻译:在大型数据集上培训神经网络需要大量计算费用。 数据集减少选择或综合基于大型数据集的数据实例,同时从完整的数据集中最大限度地减少一般性能的退化。 现有的方法在数据集减少程序期间使用神经网络, 模型参数在减少后成为保存性能的重要因素。 取决于参数的重要性, 本文引入一个新的减排目标, 与原始数据集损失曲线和模型参数空间上减少的数据集相匹配, 多于参数点。 这个新目标使得在被缩小的参数区域对减少的数据集进行更好的调整, 而不是精确点匹配。 特别是, 我们从本地参数区域找出损失曲线差距的最坏的例子, 我们从理论分析中得出这种最坏情况的可执行的上限。 我们关于核心集选择和集中基准的实验表明, LCMat 显示比现有基线更好的概括性表现。</s>