The recent success of deep learning has partially been driven by training increasingly overparametrized networks on ever larger datasets. It is therefore natural to ask: how much of the data is superfluous, which examples are important for generalization, and how do we find them? In this work, we make the striking observation that, on standard vision benchmarks, the initial loss gradient norm of individual training examples, averaged over several weight initializations, can be used to identify a smaller set of training data that is important for generalization. Furthermore, after only a few epochs of training, the information in gradient norms is reflected in the normed error--L2 distance between the predicted probabilities and one hot labels--which can be used to prune a significant fraction of the dataset without sacrificing test accuracy. Based on this, we propose data pruning methods which use only local information early in training, and connect them to recent work that prunes data by discarding examples that are rarely forgotten over the course of training. Our methods also shed light on how the underlying data distribution shapes the training dynamics: they rank examples based on their importance for generalization, detect noisy examples and identify subspaces of the model's data representation that are relatively stable over training.
翻译:最近深层次学习的成功,部分是由于培训在更大的数据集上越来越过分的超分化网络的训练所促成的,因此自然地要问:数据中有多少是多余的,哪些例子对概括化很重要,我们如何找到这些例子?在这项工作中,我们得出惊人的观察,即根据标准愿景基准,个人培训实例的初始损失梯度标准,平均超过几个重量初始化标准,可以用来确定对概括化很重要的较小的培训数据组。此外,在经过几步培训之后,梯度规范中的信息还反映在标准错误-L2的距离中,预测概率和一个热标签可以用来在不牺牲测试准确性的情况下大量减少数据集。在此基础上,我们提出数据运行方法,在培训初期只使用当地信息,并将这些数据组同通过抛弃在培训过程中很少被遗忘的范例来填补数据组的近期工作联系起来。我们的方法还揭示了基本数据分布如何影响培训动态:它们根据相对空间代表的重要性,排列实例,用于比较稳定的一般化,对次级数据进行勘测。