Many recent works on understanding deep learning try to quantify how much individual data instances influence the optimization and generalization of a model, either by analyzing the behavior of the model during training or by measuring the performance gap of the model when the instance is removed from the dataset. Such approaches reveal characteristics and importance of individual instances, which may provide useful information in diagnosing and improving deep learning. However, most of the existing works on data valuation require actual training of a model, which often demands high-computational cost. In this paper, we provide a training-free data valuation score, called complexity-gap score, which is a data-centric score to quantify the influence of individual instances in generalization of two-layer overparameterized neural networks. The proposed score can quantify irregularity of the instances and measure how much each data instance contributes in the total movement of the network parameters during training. We theoretically analyze and empirically demonstrate the effectiveness of the complexity-gap score in finding 'irregular or mislabeled' data instances, and also provide applications of the score in analyzing datasets and diagnosing training dynamics.
翻译:许多最近关于深层学习的研究试图量化个体数据实例在多大程度上影响模型的优化和概括化,办法是分析模型在培训期间的行为,或在将模型的性能差距从数据集中删除时衡量模型的性能差距。这些方法揭示了单个实例的特点和重要性,这些实例可能为诊断和改进深层学习提供有用的信息。然而,大多数现有的数据估值工作要求对模型进行实际培训,而这往往需要高计算成本。在本文件中,我们提供了一种不培训的数据评价分,称为“复杂性差距评分”,这是一种以数据为中心的评分,用以量化个体案例在两层超分度神经网络一般化过程中的影响。提议的评分可以量化各种实例的不规则性,并衡量每个数据实例对培训期间网络参数总体移动做出多大贡献。我们从理论上分析和经验上证明复杂差距评分在查找“不规则或错误标签”数据实例方面的有效性,并在分析数据集和诊断培训动态方面提供评分的应用。