Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as infeasible to apply to large datasets. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient and can scale to millions of data by reusing trained weak learners. Specifically, Data-OOB takes less than 2.25 hours on a single CPU processor when there are $10^6$ samples to evaluate and the input dimension is 100. Furthermore, Data-OOB has solid theoretical interpretations in that it identifies the same important data point as the infinitesimal jackknife influence function when two different points are compared. We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.
翻译:数据估值是一种强大的框架,可为模型训练中有益或有害的数据提供统计洞察。许多基于 Shapley 的数据估值方法在各种下游任务中显示出了很有前途的结果,然而,由于需要训练大量模型,它们被广泛认为是计算上具有挑战性。因此,将其应用于大型数据集被认为是不可行的。针对这个问题,我们提出了 Data-OOB,这是一种针对 Bagging 模型的新数据估值方法,它利用了外袋估计。所提出的方法计算效率高,并且可以通过重复使用已训练的弱学习器来扩展到数百万的数据。具体而言,当有 $10^6$ 个样本需要评估且输入维度为 100 时,Data-OOB 仅需花费不到 2.25 小时的单个 CPU 处理器时间。此外,在理论上,Data-OOB 有可靠的解释,当比较两个不同点时,它识别了与无限小杰克内夫影响函数相同的重要数据点。我们使用 12 个分类数据集进行全面实验,每个数据集都有数千个样本大小。我们证明所提出的方法在识别错误标记的数据和找到一组有用(或有害)数据点方面明显优于现有的最先进数据估值方法,突显了在实际应用中应用数据价值的潜力。