深度学习中的数据节食：早期发现重要样本 (Deep Learning on a Data Diet: Finding Important Examples Early in Training)

Recent success in deep learning has partially been driven by training increasingly overparametrized networks on ever larger datasets. It is therefore natural to ask: how much of the data is superfluous, which examples are important for generalization, and how do we find them? In this work, we make the striking observation that, in standard vision datasets, simple scores averaged over several weight initializations can be used to identify important examples very early in training. We propose two such scores -- the Gradient Normed (GraNd) and the Error L2-Norm (EL2N) scores -- and demonstrate their efficacy on a range of architectures and datasets by pruning significant fractions of training data without sacrificing test accuracy. In fact, using EL2N scores calculated a few epochs into training, we can prune half of the CIFAR10 training set while slightly improving test accuracy. Furthermore, for a given dataset, EL2N scores from one architecture or hyperparameter configuration generalize to other configurations. Compared to recent work that prunes data by discarding examples that are rarely forgotten over the course of training, our scores use only local information early in training. We also use our scores to detect noisy examples and study training dynamics through the lens of important examples -- we investigate how the data distribution shapes the loss surface and identify subspaces of the model's data representation that are relatively stable over training.

翻译：近年来，深度学习的成功在一定程度上是由越来越多地使用超参数化网络和更大的数据集进行训练推动的。因此，我们自然而然地问：有多少数据是非必需的，哪些样本对于泛化是重要的，我们如何找到它们呢？在这项工作中，我们做出了一个惊人的观察，即在标准视觉数据集中，简单地对几个权重初始化的分数进行平均，就可以在训练的早期阶段识别出重要的示例。我们提出了两个这样的分数——梯度规范化（GraNd）和误差 L2 规范化（EL2N）分数——并通过剪枝大量训练数据而不损失测试精度，在一系列架构和数据集上展示了它们的有效性。事实上，使用 EL2N 分数在几个时期的训练中计算，我们可以剪枝掉一半的 CIFAR10 训练集，同时略微提高测试准确性。此外，对于给定的数据集，EL2N 分数从一种架构或超参数配置泛化到其他配置。与最近的工作相比，后者通过丢弃在训练过程中很少被忘记的示例来剪枝数据，我们的分数仅在训练的早期阶段使用本地信息。我们还使用我们的分数检测嘈杂的示例，并通过重要示例的视角研究训练动态——我们调查数据分布如何塑造损失表面，并且识别模型数据表示的子空间相对稳定地固定在训练过程中。