Understanding the decision-making process of machine learning models is crucial for ensuring trustworthy machine learning. Data Shapley, a landmark study on data valuation, advances this understanding by assessing the contribution of each datum to model performance. However, the resource-intensive and time-consuming nature of multiple model retraining poses challenges for applying Data Shapley to large datasets. To address this, we propose the CHG (compound of Hardness and Gradient) utility function, which approximates the utility of each data subset on model performance in every training epoch. By deriving the closed-form Shapley value for each data point using the CHG utility function, we reduce the computational complexity to that of a single model retraining, achieving a quadratic improvement over existing marginal contribution-based methods. We further leverage CHG Shapley for real-time data selection, conducting experiments across three settings: standard datasets, label noise datasets, and class imbalance datasets. These experiments demonstrate its effectiveness in identifying high-value and noisy data. By enabling efficient data valuation, CHG Shapley promotes trustworthy model training through a novel data-centric perspective. Our codes are available at https://github.com/caihuaiguang/CHG-Shapley-for-Data-Valuation and https://github.com/caihuaiguang/CHG-Shapley-for-Data-Selection.
翻译:暂无翻译