Many recent works on understanding deep learning try to quantify how much individual data instances influence the optimization and generalization of a model. Such attempts reveal characteristics and importance of individual instances, which may provide useful information in diagnosing and improving deep learning. However, most of the existing works on data valuation require actual training of a model, which often demands high-computational cost. In this paper, we provide a training-free data valuation score, called complexity-gap score, which is a data-centric score to quantify the influence of individual instances in generalization of two-layer overparameterized neural networks. The proposed score can quantify irregularity of the instances and measure how much each data instance contributes in the total movement of the network parameters during training. We theoretically analyze and empirically demonstrate the effectiveness of the complexity-gap score in finding `irregular or mislabeled' data instances, and also provide applications of the score in analyzing datasets and diagnosing training dynamics. Our code is publicly available at https://github.com/JJchy/CG_score
翻译:最近许多关于理解深层次学习的工作都试图量化有多少个别数据实例对模型的优化和概括影响,这些尝试揭示了个别实例的特点和重要性,这些实例可能为诊断和改进深层学习提供有用的信息,然而,大多数现有的数据估值工作需要对模型进行实际培训,而模型往往需要很高的计算成本。在本文件中,我们提供了一种不培训的数据估价分,称为复杂差距分,这是一种以数据为中心的分,用以量化个别实例对两层超分度神经网络一般化的影响。提议的评分可以量化每个实例的不规则性,并衡量每个数据实例对培训期间网络参数总体流动的贡献程度。我们从理论上分析和实验性地展示了复杂差距评分在寻找“不固定或错误标签”数据实例方面的有效性,并且提供了在分析数据集和诊断分析培训动态方面的评分应用。我们的代码公布在https://github.com/JJchy/C_score。</s>